Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning

Park, Sangwoo; Simeone, Osvaldo

doi:10.3390/e24101363

Open AccessArticle

Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning

by

Sangwoo Park

^*

and

Osvaldo Simeone

Department of Engineering, King’s College London, London WC2R 2LS, UK

^*

Author to whom correspondence should be addressed.

Entropy 2022, 24(10), 1363; https://doi.org/10.3390/e24101363

Submission received: 3 August 2022 / Revised: 19 September 2022 / Accepted: 23 September 2022 / Published: 26 September 2022

(This article belongs to the Special Issue Wireless Networks: Information Theoretic Perspectives Ⅱ)

Download

Browse Figures

Review Reports Versions Notes

Abstract

An efficient data-driven prediction strategy for multi-antenna frequency-selective channels must operate based on a small number of pilot symbols. This paper proposes novel channel-prediction algorithms that address this goal by integrating transfer and meta-learning with a reduced-rank parametrization of the channel. The proposed methods optimize linear predictors by utilizing data from previous frames, which are generally characterized by distinct propagation characteristics, in order to enable fast training on the time slots of the current frame. The proposed predictors rely on a novel long short-term decomposition (LSTD) of the linear prediction model that leverages the disaggregation of the channel into long-term space-time signatures and fading amplitudes. We first develop predictors for single-antenna frequency-flat channels based on transfer/meta-learned quadratic regularization. Then, we introduce transfer and meta-learning algorithms for LSTD-based prediction models that build on equilibrium propagation (EP) and alternating least squares (ALS). Numerical results under the 3GPP 5G standard channel model demonstrate the impact of transfer and meta-learning on reducing the number of pilots for channel prediction, as well as the merits of the proposed LSTD parametrization.

Keywords:

channel prediction; meta-learning; multi-antenna frequency-selectivity; equilibrium propagation

1. Introduction

The capacity to accurately predict channel state information (CSI) is a key enabler of proactive resource allocation strategies, which are central to many visions for efficient and low-latency communications in 6G and beyond (see, e.g., [1]). The problem of channel prediction is relatively straightforward in the presence of known channel statistics. In fact, under the common assumption that multi-antenna frequency-selective channels follow stationary complex Gaussian processes, optimal channel predictors can be obtained via linear minimum mean squared error (LMMSE) estimators such as the Wiener filter [2]. However, in practice, the channel statistics are not known, and predictors need to be optimized based on training data obtained through the transmission of pilot signals [3,4,5,6,7,8]. The problem addressed by this paper concerns the design of data-efficient channel predictors for multi-antenna frequency-selective channels.

1.1. Context and Prior Art

A classical approach to tackle this problem is to optimize finite impulse response (FIR) filters [9], or recursive linear filters via autoregressive (AR) models [3,4,5] such as Kalman filtering (KF) [6,7,8], by estimating channel statistics from the available pilot data. Although recursive linear filters generally outperform FIR filters when an accurate model of the state-transition dynamics is available [10,11], FIR filters are typically advantageous in the presence of limited amounts of pilot data [12]. More recently, deep learning-based nonlinear predictors have also been proposed to adapt to channel statistics through the training of neural networks, namely recurrent neural networks [13,14,15,16], convolutional neural networks [17,18], and multi-layer perceptrons [19].

As reported in [14,15,16,19], deep learning-based predictors tend to require larger training (pilot) data, and fail to outperform well-designed linear filters in the low-data regime. Some solutions addressing this issues include [20], which applies reinforcement learning to determine whether to predict channels or not at the current time, and the use of hypernetworks to adapt parameters of a KF accordingly to current channel dynamics [12].

Most prior work, with the notable exception of [12], focuses on the optimization of channel predictors under the assumption of a stationary spatio-temporal correlation function across the time interval of interest. This conventional approach fails to leverage common structure that may exist across multiple frames, with each frame being characterized by distinct spatio-temporal correlations (see Figure 1). Reference [12] allowed for varying Doppler spectra across frames, through a deep learning-based hypernetwork that is used to adapt the parameters of a generative model [12].

This paper takes a different approach that allows us to move beyond the single-antenna setting studied in [12]. As described in the next subsection, key ingredients of the proposed methods are transfer learning and meta-learning. Transfer learning [21] and meta-learning [22] aim at using knowledge from distinct tasks in order to reduce the data requirements on a new task of interest. Given a large enough resemblance between different tasks, both transfer learning and meta-learning have shown remarkable performance to reduce the sample complexity in general machine learning problems [23]. Transfer learning applies to a specific target task, whereas meta-learning caters to adaptation to any new task (see e.g., [24]).

Previous applications of transfer learning to communication systems include beamforming for multi-user, multiple-input, single-output (MISO) downlink [25] and for intelligent reflecting surfaces (IRS)-assisted MISO downlink [26], and downlink channel prediction [27,28] (see also [25,27]). Meta-learning has been applied to communication systems, including demodulation [29,30,31,32], decoding [33], end-to-end design of encoding and decoding with and without a channel model [34,35]; MIMO detection [36], beamforming for multiuser MISO downlink systems via [37], layered division multiplexing for ultra-reliable communications [38], UAV trajectory design [39], and resource allocation [40].

1.2. Contributions

This paper proposes novel efficient data-driven channel prediction algorithms that reduce pilot requirements by integrating transfer and meta-learning with a novel long-short-term decomposition (LSTD) of the linear predictors. Unlike the prior articles reviewed above, the proposed methods apply to multi-antenna frequency-selective channels whose statistics change across frames (see Figure 1). Specific contributions are as follows.

We develop efficient predictors for single-antenna frequency-flat channels based on transfer/meta-learned quadratic regularization. Transfer and meta-learning are used to leverage data from multiple frames in order to extract shared useful knowledge that can be used for prediction on the current frame (see Figure 2).
Targeting multi-antenna frequency-selective channels, we introduce the LSTD-based model class of linear predictors that builds on the well-known disaggregation of standard channel models into long-term space-time signatures and fading amplitudes [5,41,42,43,44]. Accordingly, the channel is described by multipath features, such as angle of arrivals, delays, and path loss, that change slowly across the frame, as well as by fast-varying fading amplitudes. Transfer learning and meta-learning algorithms for LSTD-based prediction models are proposed that build on equilibrium propagation (EP) and alternating least squares (ALS).
Numerical results under the 3GPP 5G standard channel model demonstrate the impact of transfer and meta-learning on reducing the number of pilots for channel prediction, as well as the merits of the proposed LSTD parametrization.

Part of this paper was presented in [45], which only covered meta-learning for the case of single-antenna frequency-flat channels. As compared to [45], this journal version includes both transfer and meta-learning, and it addresses the general scenario of multi-antenna frequency-selective channels by introducing and developing the LSTD model class of linear predictors.

1.3. Organization

The rest of the paper is organized as follows. In Section 2, we detail system and channel models, and describe conventional, transfer, and meta-learning concepts. In Section 3, we develop solutions for single-antenna frequency-flat channels. In Section 4, multi-antenna frequency-selective channels are considered, and we propose LSTD-based linear prediction schemes. Numerical results are presented in Section 5, and conclusions are presented in Section 6.

Notation: In this paper,

{(\cdot)}^{⊤}

denotes the transposition;

{(\cdot)}^{†}

the Hermitian transposition,

{(\cdot)}_{F}

the Frobenius norm,

| \cdot |

the absolute value,

| | \cdot | |

the Euclidean norm,

vec (\cdot)

the vectorization operator that stacks the columns of a matrix into a column vector,

{[\cdot]}_{i}

the i-th element of the vector, and

I_{S}

the

S \times S

identity matrix for some integer S.

2. System Model

2.1. System Model

As shown in Figure 1, we study a frame-based transmission system, with each frame containing multiple time slots. Each frame carries data from a possibly different user to the same receiver, e.g., a base station. The receiver has

N_{R}

antennas, and the transmitters have

N_{T}

antennas. The channel

h_{l, f}

in slot

l = 1, 2, \dots

of frame

f = 1, 2, \dots

is a vector with

S = N_{R} N_{T} W

entries, with W being the delay spread measured in number of transmission symbols within each frame f, the multi-path channels

h_{l, f} \in C^{N_{R} N_{T} W \times 1}

are characterized by fixed, frame-dependent, average path powers, path delays, Doppler spectra, and angles of arrival and departure [46]. For instance, in a frame f, we may have a slow-moving user in line-of-sight condition subject to time-invariant fading, whereas in another, the channel may have significant scattering with fast temporal variations with a large Doppler frequency. In both cases, the frame is assumed to be short enough that average path powers, path delays, Doppler spectra, and angles of arrival and departure do not change within the frame [41,42].

As also seen in Figure 1, for each frame f, we are interested in addressing the lag-

δ

channel prediction problem, in which channel

h_{l + δ, f}

is predicted based on the N past channels:

\begin{matrix} H_{l, f}^{N} = [h_{l, f}, \dots, h_{l - N + 1, f}] \in C^{S \times N} . \end{matrix}

(1)

We adopt linear prediction with regressor

V_{f} \in C^{S N \times S}

, so that the prediction is given as

\begin{matrix} {\hat{h}}_{l + δ, f} = V_{f}^{†} vec (H_{l, f}^{N}) . \end{matrix}

(2)

The focus on linear prediction is justified by the optimality of linear estimation for Gaussian stationary processes [47], which provide standard models for fading channels in rich scattering environments.

Assuming no prior knowledge of the channel model, we adopt a data-driven approach to the design of the predictor (2). Accordingly, to train the linear predictor (2), for any frame f, the receiver is assumed to have available the training set

\begin{matrix} Z_{f}^{tr} = {(x_{i, f}, y_{i, f})}_{i = 1}^{L^{tr}} \equiv {(vec (H_{l, f}^{N}), h_{l + δ, f})}_{l = N}^{L^{tr} + N - 1} \end{matrix}

(3)

encompassing

L^{tr}

input–output examples. Dataset

Z_{f}^{tr}

can be constructed from

L^{tr} + N + δ - 1

channels

{h_{1, f}, \dots, h_{L^{tr} + N + δ - 1, f}}

by using the lag-

δ

channel

h_{l + δ, f}

as label for the covariate vector

vec (H_{l, f}^{N})

. In practice, the channel vectors

h_{l, f}

are estimated by using pilot symbols, and estimation noise can be easily incorporated in the model (see Section 2.5). Throughout, we implicitly assume that the channels

h_{l, f}

correspond to estimates available at the receiver.

From dataset

Z_{f}^{tr}

in (3), we write the corresponding

L^{tr} \times S N

input matrix

X_{f}^{tr} = {[x_{1, f}^{†}, \dots, x_{L^{tr}, f}^{†}]}^{⊤}

, and the

L^{tr} \times S

target matrix

Y_{f}^{tr} = {[y_{1, f}^{†}, \dots, y_{L^{tr}, f}^{†}]}^{⊤}

, so that the dataset can be expressed as the pair

Z_{f}^{tr} = (X_{f}^{tr}, Y_{f}^{tr})

.

2.2. Channel Model

We adopt the standard spatial channel model [46]. Accordingly, a channel vector

h_{l, f}

for slot l in frame f, is obtained by sampling the continuous-time multipath vector channel impulse response

\begin{matrix} h_{l, f} (τ) = \sum_{d = 1}^{D} \sqrt{Ω_{d, f}} a_{d, f} g (τ - τ_{d, f}) \exp (- j 2 π γ_{d, f} t_{l}), \end{matrix}

(4)

which is the sum of contributions from D paths. In (4), the waveform

g (τ)

is given by the convolution of the transmitted waveform and the matched filter at the receiver. Furthermore, the contribution of the d-th path depends on the average power

Ω_{d, f}

, the path delay

τ_{d, f}

, the

N_{T} N_{R} \times 1

spatial vector

a_{d, f}

, the Doppler frequency

γ_{d, f}

, and the starting wall-clock time of the l-th slot

t_{l}

. The average power

Ω_{d, f}

, path delays

τ_{d, f}

, spatial vector

a_{d, f}

, and Doppler frequency

γ_{d, f}

are constant within one frame because they depend on large-scale geometric features of the propagation environment. However, they may change over frames following Clause 7.6.3.2 (Procedure B) in [46]. The number of paths is assumed without loss of generalization to be the same for all frames f because one can set

Ω_{d, f} = 0

for frames with a smaller number of paths.

In [46], the spatial vector

a_{d, f}

has a structure that depends on field patterns and steering vectors of the transmit and receive antennas, as well on the polarization of the antennas. Mathematically, the entry of the spatial vector

a_{d, f}

corresponding to the receive and transmit antenna element

n_{R}

and

n_{T}

can be modeled as [46]

\begin{matrix} {[a_{d, f}]}_{n_{R} + (n_{T} - 1) N_{R}} = F_{r x, n_{R}} {(θ_{d, f, Z O A}, ϕ_{d, f, A O A})}^{T} M_{d, f} \\ \cdot F_{t x, n_{T}} (θ_{d, f, Z O D}, ϕ_{d, f, A O D}) exp (- \frac{j 2 π l_{d, f, n_{R}, n_{T}}}{λ_{0}}), \end{matrix}

(5)

where

F_{r x, n_{R}} (\cdot, \cdot)

and

F_{t x, n_{T}} (\cdot, \cdot)

are the

2 \times 1

field patterns,

θ_{d, f, Z O A}

,

ϕ_{d, f, A O A}

,

θ_{d, f, Z O D}

, and

ϕ_{d, f, A O D}

are the zenith angle of arrival (ZOA), azimuth angle of arrival (AOA), zenith angle of departure (ZOD), and azimuth angle of departure (AOD) (in degrees),

λ_{0}

is the wavelength (in m) of the carrier frequency,

l_{d, f, n_{R}, n_{T}}

is the length of the path (in m) between the two antennas, and

M_{d, f}

is the polarization coupling matrix defined as

\begin{matrix} M_{d, f} = (\begin{matrix} exp (j Φ_{d, f}^{θ θ}) \sqrt{1 / κ_{d, f}} exp (j Φ_{d, f}^{θ ϕ}) \\ \sqrt{1 / κ_{d, f}} exp (j Φ_{d, f}^{ϕ θ}) exp (j Φ_{d, f}^{ϕ ϕ}) \end{matrix}), \end{matrix}

(6)

with random initial phase

Φ_{d, f}^{(\cdot, \cdot)} \sim U (- π, π)

and log-normal distributed cross polarization power ratio (XPR)

κ_{d, f} > 0

[46].

In order to obtain the

S \times 1

vector

h_{l, f}

, we sample the continuous-time channel

h_{l, f} (τ)

in (4) at Nyquist rate

1 / T

to obtain W discrete-time

N_{R} N_{T} \times 1

channel impulse response

\begin{matrix} h_{l, f} [w] = h_{l, f} ((w - 1) T) \end{matrix}

(7)

for

w = 1, \dots, W

. Following [41], the channel vector

h_{l, f} \in C^{N_{R} N_{T} W \times 1}

is obtained by concatenating the W channel vectors

h_{l, f} [w]

for

w = 1, \dots, W

as

\begin{matrix} h_{l, f} = {[h_{l, f} [1], \dots, h_{l, f} [W]]}^{⊤} . \end{matrix}

(8)

2.3. Conventional Learning

The optimization of the linear predictor

V_{f}

in (2) can be formulated as a supervised learning problem as it will be detailed in Section 3. In conventional learning, the predictor

V_{f}

is designed separately in each frame f based on the corresponding dataset

Z_{f}^{tr}

. In order for this predictor

V_{f}

to generalize well to slots in the same frame f outside the training set, it is necessary to have a sufficiently large number of training slots,

L^{tr}

[48,49].

2.4. Transfer Learning and Meta-Learning

In conventional learning, the number of required training slots

L^{tr}

can be reduced by selecting hyperparameters in the learning problem that reflect prior knowledge about the prediction problem at hand. In the next sections, we will explore solutions that optimize such hyperparameters based on data received from multiple previous frames. To this end, as illustrated in Figure 2, we assume the availability of channel data collected from F frames received in the past. In each frame, the channel follows the model described in Section 2.2. Accordingly, data from previous frames consists of

L + N + δ - 1

channels

{h_{1, f}, \dots, h_{L + N + δ - 1, f}}

for some integer L.

By using these channels, the dataset

\begin{matrix} Z_{f} = {(x_{i, f}, y_{i, f})}_{i = 1}^{L} \equiv {(vec (H_{l, f}^{N}), h_{l + δ, f})}_{l = N}^{L + N - 1} \end{matrix}

(9)

can be obtained as explained in Section 2.1, where L is typically larger than

L^{tr}

, although this will not be assumed in the analysis. Correspondingly, we also define the

L \times N

input matrix

X_{f}

and the

L \times 1

target vector

y_{f}

. We will propose methods that leverage the historical knowledge available from dataset

Z_{f}

for

f = 1, \dots, F

via transfer learning and meta-learning with the goal of reducing number of pilots,

L^{tr}

, needed for channel prediction in a new frame (i.e., frame

F + 1

in Figure 2).

2.5. Incorporating Estimation Noise

Until now, we assumed that channel vectors

h_{l, f}

are available noiselessly to the predictor. In practice, channel information needs to be estimated via pilots. To elaborate on this point, let us assume the received signal model

\begin{matrix} \begin{matrix} y_{l, f}^{p} [i] = h_{l, f} x_{l, f}^{p} [i] + n_{l, f} [i], \end{matrix} \end{matrix}

(10)

where

x_{l, f}^{p} [i]

stands for the ith transmitted pilot symbol in block l of frame f,

y_{l, f}^{p} [i]

for the corresponding received signal, and

h_{l, f}

for the channel with additive white complex Gaussian noise

n_{l, f} [i] \sim CN (0, N_{0} I_{S})

. Given an average energy constraint

E [x_{l, f}^{p} {[i]}^{2}] = E_{x}

for the training symbol, the average signal-to-noise ratio (SNR) is given as

E_{x} / N_{0}

. From (10), we can estimate the channel as

\begin{matrix} {\overset{ˇ}{h}}_{l, f} = \frac{y_{l, f}^{p} [i]}{x_{l, f}^{p} [i]} = h_{l, f} + \frac{n_{l, f} [i]}{x_{l, f}^{p} [i]} = h_{l, f} + ξ, \end{matrix}

(11)

which suffers from channel estimation noise

ξ \sim CN (0, {SNR}^{- 1} I_{S})

. If P training symbols are available in each block, the channel estimation noise can be reduced via averaging to

{SNR}^{- 1} / P

. Channels

{\overset{ˇ}{h}}_{l, f}

can be used as training data in the schemes described in the previous subsections. More efficient channel estimation methods, including sparse Bayesian learning [50] and approximate message passing approaches [51] may further reduce the channel estimation noise.

3. Single-Antenna Frequency-Flat Channels

In this section, we propose transfer learning and meta-learning methods for single-antenna flat-fading channels, which result in

S = 1

. Throughout this section, we write the prediction matrix

V_{f} \in C^{S N \times S}

in (2) as the vector

v_{f} \in C^{N \times 1}

, and the target data

Y_{f}^{tr} \in C^{L^{tr} \times S}

as the vector

y_{f}^{tr} \in C^{L^{tr} \times 1}

. Correspondingly, we rewrite the linear predictor (2) as

\begin{matrix} {\hat{h}}_{l + δ, f} = v_{f}^{†} vec (H_{l, f}^{N}) . \end{matrix}

(12)

3.1. Conventional Learning

Assuming the standard quadratic loss, we formulate the supervised learning problem as the ridge regression optimization

\begin{matrix} v^{*} (Z_{f}^{tr} | \bar{v}) = \underset{v_{f} \in C^{N \times 1}}{arg min} \{| | X_{f}^{tr} v_{f} - y_{f}^{tr} {| |}^{2} + λ | | v_{f} - \bar{v} {| |}^{2}\}, \end{matrix}

(13)

with hyperparameters

(λ, \bar{v})

given by the scalar

λ > 0

and by the

N \times 1

bias vector

\bar{v}

. The bias vector

\bar{v}

can be thought of defining the prior mean of the predictor

v_{f}

, whereas

λ > 0

specifies the precision (i.e., inverse of the variance) of this prior knowledge. The solution of problem (13) can be obtained explicitly as

\begin{matrix} v^{*} (Z_{f}^{tr} | \bar{v}) = {(A_{f}^{tr})}^{- 1} ({(X_{f}^{tr})}^{†} y_{f}^{tr} + λ \bar{v}), with A_{f}^{tr} = {(X_{f}^{tr})}^{†} X_{f}^{tr} + λ I . \end{matrix}

(14)

3.2. Transfer Learning

Transfer learning uses datasets

Z_{f}

in (9) from the previous F frames, i.e., with

f = 1, \dots, F

, to optimize the hyperparameter vector

\bar{v}

in (13) as

\begin{matrix} {\bar{v}}^{trans} & = \underset{v \in C^{N \times 1}}{arg min} \{\sum_{f = 1}^{F} {∥X_{f} v - y_{f}∥}^{2}\} . \end{matrix}

(15)

The rationale for this choice is that vector

{\bar{v}}^{trans}

provides a useful prior mean to be used in the ridge regression problem (13), because it corresponds to an optimized predictor for the previous frames. Having optimized the bias vector

{\bar{v}}^{trans}

, we train a channel predictor v via ridge regression (13) by using the training data

Z_{f^{new}}

for a new frame

f^{new}

with

L^{new}

training samples, to obtain

\begin{matrix} v_{f^{new}}^{*} = v^{*} (Z_{f^{new}} | {\bar{v}}^{trans}) . \end{matrix}

(16)

Note that during deployment time, this approach has the same computational complexity as that of conventional learning, because the bias vector is treated as a constant vector.

3.3. Meta-Learning

Unlike transfer learning, which utilizes all the available datasets

{Z_{f}}_{f = 1}^{F}

from the previous frames at once as in (15), meta-learning allows for the separate adaptation of the predictor in each frame. To this end, for each frame f, we split the L data points into

L^{tr}

training pairs

{(x_{i, f}, y_{i, f})}_{i = 1}^{L^{tr}} \equiv {(x_{i, f}^{tr}, y_{i, f}^{tr})}_{i = 1}^{L^{tr}} = Z_{f}^{tr}

and

L^{te} = L - L^{tr}

test pairs

{(x_{i, f}, y_{i, f})}_{i = L^{tr} + 1}^{L} \equiv {(x_{i, f}^{te}, y_{i, f}^{te})}_{i = 1}^{L^{te}} = Z_{f}^{te}

, resulting in two separate datasets,

Z_{f}^{tr}

and

Z_{f}^{te}

. We correspondingly define the

L^{tr} \times N

input matrix

X_{f}^{tr}

and the

L^{tr} \times 1

target vector

y_{f}^{tr}

, as well as the

L^{te} \times N

input matrix

X_{f}^{te}

and the

L^{te} \times 1

target vector

y_{f}^{te}

.

The hyperparameter vector

\bar{v}

is then optimized by minimizing the sum loss of the predictors

v^{*} (Z_{f}^{tr} | \bar{v})

in (13) that are adapted separately for each frame

f = 1, \dots, F

given the bias vector

\bar{v}

. Accordingly, estimating the loss in each frame f via the test set

Z_{f}^{te}

yields the meta-learning problem

\begin{matrix} {\bar{v}}^{meta} & = \underset{\bar{v} \in C^{N \times 1}}{arg min} \{\sum_{f = 1}^{F} | v^{*} {(Z_{f}^{tr} | \bar{v})}^{†} x_{i, f}^{te} - y_{i, f}^{te} |^{2}\} . \end{matrix}

(17)

As studied in [52], the minimization in (17) is a least squares problem that can be solved in closed form as

\begin{matrix} {\bar{v}}^{meta} & = \underset{\bar{v} \in C^{N \times 1}}{arg min} \sum_{f = 1}^{F} {∥{\tilde{X}}_{f}^{te} \bar{v} - {\tilde{y}}_{f}^{te}∥}^{2} \\ = {({\tilde{X}}^{†} \tilde{X})}^{- 1} {\tilde{X}}^{†} \tilde{y}, \end{matrix}

(18)

where

L^{te} \times N

matrix

{\tilde{X}}_{f}^{te}

contains by row the Hermitian transpose of the

N \times 1

pre-conditioned input vectors

{λ {(A_{f}^{tr})}^{- 1} x_{i, f}^{te}}_{i = 1}^{L^{te}}

, with

A_{f}^{tr} = {(X_{f}^{tr})}^{†} X_{f}^{tr} + λ I

,;

{\tilde{y}}_{f}^{te}

is

L^{te} \times 1

vector containing vertically the complex conjugate of the transformed outputs

{{(y_{i, f}^{te} - {(y_{f}^{tr})}^{†} X_{f}^{tr} {(A_{f}^{tr})}^{- 1} x_{i, f}^{te}}}_{i = 1}^{L^{te}}

, the

F L^{te} \times N

matrix

\tilde{X} = {[{\tilde{X}}_{1}^{te}, \dots, {\tilde{X}}_{F}^{te}]}^{⊤}

stacks vertically the

L^{te} \times N

matrices

{{\tilde{X}}_{f}^{te}}_{f = 1}^{F}

, and the

F L^{te} \times 1

vector

\tilde{y} = {[{\tilde{y}}_{1}^{te}, \dots, {\tilde{y}}_{F}^{te}]}^{⊤}

stacks vertically the

L^{te} \times 1

vectors

{{\tilde{y}}_{f}^{te}}_{f = 1}^{F}

. Unlike standard meta-learning algorithms used by most papers on communications [25,29,30,32,33,34], the proposed meta-learning procedure adopts linear models, significantly reducing the computational complexity of meta-learning [52].

After meta-learning, similar to transfer learning, based on the meta-learned hyperparameter

{\bar{v}}_{λ}^{meta}

, we train a channel predictor via ridge regression (13), obtaining

\begin{matrix} v_{f^{new}}^{*} = v^{*} (Z_{f^{new}} | {\bar{v}}^{meta}) . \end{matrix}

(19)

4. Multi-Antenna Frequency-Selective Channels

In this section, we study the more general scenario with any number of antennas and with frequency-selective channels, resulting in

S > 1

. As we will discuss, a naïve extension of the techniques presented in the previous sections is undesirable, because this would not leverage the structure of the channel model (4). For this reason, in the following, we will introduce novel hybrid model- and data-driven solutions that build on the channel model (4).

4.1. Naïve Extension

We start by briefly presenting the direct extension of the approaches studied in the previous section to any

S > 1

. Unlike the previous section, we adopt the general matrix notation introduced in Section 2. First, with

S = 1

, conventional learning obtains the predictor by solving problem (13), which is generalized to any

S > 1

as the minimization

\begin{matrix} V^{*} (Z_{f}^{tr} | \bar{V}) = \underset{V_{f} \in C^{S N \times S}}{arg min} \{| | X_{f}^{tr} V_{f} - Y_{f}^{tr} {| |}_{F}^{2} + λ | | V_{f} - \bar{V} {| |}_{F}^{2}\} \end{matrix}

(20)

over the linear prediction matrix

V_{f}

in (2). Similarly, transfer learning computes the bias matrix

{\bar{V}}^{trans}

by solving the following generalization of problem (15),

\begin{matrix} {\bar{V}}^{trans} & = \underset{V \in C^{S N \times S}}{arg min} \{\sum_{f = 1}^{F} {∥X_{f} V - Y_{f}∥}_{F}^{2}\}, \end{matrix}

(21)

followed by the evaluation of the predictor

V^{*} (Z_{f}^{tr} | {\bar{V}}^{trans})

using (20); whereas meta-learning addresses the following generalization of minimization (17),

\begin{matrix} {\bar{V}}^{meta} & = \underset{\bar{V} \in C^{S N \times S}}{arg min} \{\sum_{f = 1}^{F} \sum_{i = 1}^{L^{te}} | V^{*} {(Z_{f}^{tr} | \bar{V})}^{†} x_{i, f}^{te} - y_{i, f}^{te} |^{2}\}, \end{matrix}

(22)

over the bias matrix

\bar{V} \in C^{S N \times S}

, which is used to compute the predictor

V^{*} (Z_{f}^{tr} | {\bar{V}}^{meta})

in (20).

The issue with the naïve extensions (21) and (22) is that the dimension of the predictor V and of the hyperparameter matrix

\bar{V}

can become extremely large when S grows. This, in turn, may lead to overfitting in the hyperparameter space [53] when the number of frames, F, is limited. This form of overfitting may prevent transfer learning and meta-learning from effectively reducing the sample complexity for problem (20), because the optimized hyperparameter matrix

\bar{V}

would be excessively dependent on the data received in the F previous frames. To solve this problem, we propose next to utilize the structure of the channel model (4) in order to reduce the dimension of the channel parametrization.

4.2. LSTD Channel Model

The channel model (4) implies that the channel vector

h_{l, f}

in (7) and (8) can be written as the product of a frame-dependent

N_{R} N_{T} W \times D

matrix

T_{f}

and of a slot-dependent

D \times 1

vector

β_{l, f}

as in [41],

\begin{matrix} h_{l, f} = T_{f} β_{l, f}, \end{matrix}

(23)

where

T_{f}

collects space-time signatures of the D paths as

\begin{matrix} T_{f} = [Ω_{1, f}^{1 / 2} g (τ_{1, f}) \otimes vec (a_{1, f}), \dots, Ω_{D, f}^{1 / 2} g (τ_{D, f}) \otimes vec (a_{D, f})], \end{matrix}

(24)

with

g (τ_{d, f}) = {[g (- τ_{d, f}), \dots, g ((W - 1) T - τ_{d, f})]}^{⊤}

being the

W \times 1

vector that collects the Nyquist-rate samples of the delayed waveform

g (τ - τ_{d, f})

, and the

D \times 1

fading amplitude vector being defined as

β_{l, f} = {[\exp (- j w_{1, f} t_{l}), \dots, \exp (- j w_{D, f} t_{l})]}^{⊤}

.

The frame-dependent matrix

T_{f}

is typically rank-deficient, because paths are generally not all resolvable [54,55]. To account for this structural property of the channel, as in [41], we introduce a

N_{R} N_{T} W \times K

full-rank unitary matrix

B_{f}

, such that

span {T_{f}} = span {B_{f}}

and redefine (23) as

\begin{matrix} h_{l, f} = B_{f} d_{l, f} . \end{matrix}

(25)

As an example, the unitary matrix

B_{f}

can be obtained from the singular value decomposition of matrix

T_{f}

, i.e.,

T_{f} = B_{f} Λ_{f}^{1 / 2} U_{f}^{†}

, by introducing the

K \times 1

vector

d_{l, f} = Λ_{f}^{1 / 2} U_{f}^{†} β_{l, f}

[41]. For future reference, we also rewrite (25) as

\begin{matrix} h_{l, f} = \sum_{k = 1}^{K} b_{f}^{k} d_{l, f}^{k}, \end{matrix}

(26)

where

d_{l, f}^{k}

is the k-th element of the vector

d_{l, f}

and

b_{f}^{k}

is the k-th column of the matrix

B_{f}

.

We will refer to matrix

B_{f}

in (26) as the long-term space-time feature matrix, or feature matrix for short, whereas vector

d_{l, f}

will be referred as the short-term corresponding amplitude vector. Parametrization (25) and (26) are particularly efficient when the feature matrix

B_{f}

can be accurately estimated from the available data. For conventional learning, this requires observing a sufficiently large number of slots per frame, i.e., a large

L^{new}

[41], as well as a channel that varies sufficiently quickly across each frame. In contrast, as we will explore, transfer and meta-learning can potentially leverage data from multiple frames in order to enhance the estimation of the feature matrix.

4.3. LSTD-Based Prediction Model

Given the LSTD channel model (25) and (26), in this subsection we redefine the problem of predicting channel

h_{l + δ, f} = B_{f} d_{l + δ, f}

as the problem of estimating the feature matrix

B_{f}

and predicting the amplitude vector

d_{l + δ, f}

based on the available data. This will lead to a reduced-rank parametrization of the linear predictor (2).

To start, we write the predicted channel

{\hat{h}}_{l + δ, f}

as

\begin{matrix} {\hat{h}}_{l + δ, f} = {\hat{B}}_{f} {\hat{d}}_{l + δ, f}, \end{matrix}

(27)

where

{\hat{B}}_{f}

and

{\hat{d}}_{l + δ, f}

are the estimated feature matrix and the predicted amplitude vector, respectively. To define the corresponding predictor, we first observe that the input matrix

H_{l, f}^{N}

in (1) can be expressed by using (25) as

\begin{matrix} H_{l, f}^{N} = B_{f} [d_{l, f}, \dots, d_{l - N + 1, f}] . \end{matrix}

(28)

Assume now that we have an estimated feature matrix

{\hat{B}}_{f}

. If this estimate is sufficiently accurate, the N past amplitudes

[d_{l, f}, \dots, d_{l - N + 1, f}] \in C^{K \times N}

can be in turn estimated from

H_{l, f}^{N}

as

\begin{matrix} [{\hat{d}}_{l, f}, \dots, {\hat{d}}_{l - N + 1, f}] = {\hat{B}}_{f}^{†} H_{l, f}^{N} . \end{matrix}

(29)

Consider now the prediction of the k-th amplitude

d_{l + δ, f}^{k}

. Generalizing (12), we adopt the linear predictor

\begin{matrix} {\hat{d}}_{l + δ, f}^{k} = {(v_{f}^{k})}^{†} vec ([{\hat{d}}_{l, f}^{k}, \dots, {\hat{d}}_{l - N + 1, f}^{k}]), \end{matrix}

(30)

where

v_{f}^{k}

is an

N \times 1

prediction vector, and

\begin{matrix} [{\hat{d}}_{l, f}^{k}, \dots, {\hat{d}}_{l - N + 1, f}^{k}] = {({\hat{b}}_{f}^{k})}^{†} H_{l, f}^{N} \in C^{1 \times N} \end{matrix}

(31)

is the k-th row of the matrix (29), which represents the past N fading scalar amplitudes that correspond to the k-th feature

b_{f}^{k}

. Plugging the prediction (30) into (27) yields the predicted channel

{\hat{h}}_{l + δ, f}

(cf. (26))

\begin{matrix} {\hat{h}}_{l + δ, f} = \sum_{k = 1}^{K} {\hat{b}}_{f}^{k} {\hat{d}}_{l + δ, f}^{k} . \end{matrix}

(32)

As detailed in Appendix A, inserting (30) and (31) to (32), we can express the LSTD-based prediction (32) in the form (2) as

\begin{matrix} {\hat{h}}_{l + δ, f} = {(V_{f}^{(K)})}^{†} vec (H_{l, f}^{N}), \end{matrix}

(33)

where the LSTD-based predictor matrix

V_{f}^{(K)} \in C^{S N \times S}

is given as

\begin{matrix} V_{f}^{(K)} = \sum_{k = 1}^{K} v_{f}^{k} \otimes ({\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†}), \end{matrix}

(34)

where ⊗ is the Kronecker product. Overall illustration of LSTD-based channel prediction is summarized in Figure 3.

The LSTD prediction model (33) reduces the dimension of the learnable parameters from

S^{2} N

(for

V_{f}

) to

(S + N) K

(for

V_{f}^{(K)}

). This complexity reduction comes at a minimal cost in terms of bias as long as the number of total features K is accurately chosen (a detailed discussion can be found in Section 4.7) and correlations across amplitudes

d_{l, f}^{k}

for different features

k = 1, \dots, K

are negligible.

4.4. Conventional Learning for LSTD-Based Prediction

In conventional learning, the goal is to optimize the LSTD-based predictor

V_{f}^{(K)}

by optimizing the feature matrix

{\hat{B}}_{f}

and the feature-wise predictors

{v_{f}^{k}}_{k = 1}^{K}

based on the available training dataset

Z_{f}^{tr}

. Substituting

V_{f}

with

V_{f}^{(K)}

defined in (34) into the naïve extension of conventional learning in (20) yields the problem

\begin{matrix} V^{(K), *} (Z_{f}^{tr} | {\bar{V}}^{(K)}) = & \underset{\begin{matrix} {\hat{B}}_{f}, v_{f}^{1}, \dots, v_{f}^{K} \\ V_{f}^{(K)} = \sum_{k = 1}^{K} v_{f}^{k} \otimes ({\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†}) \end{matrix}}{arg min} | | X_{f}^{tr} V_{f}^{(K)} - Y_{f}^{tr} {| |}_{F}^{2} + λ {∥V_{f} - {\bar{V}}^{(K)}∥}_{F}^{2}, \\ subject to {\hat{B}}_{f}^{†} {\hat{B}}_{f} = I_{K}, \end{matrix}

(35)

over the optimization variables

({\hat{B}}_{f}, {v_{f}^{k}}_{k = 1}^{K})

. In (35), the hyperparameters

(λ, {\bar{V}}^{(K)})

are given by the scalar

λ > 0

and by the

S N \times S

LSTD-based bias matrix

{\bar{V}}^{(K)}

defined as (cf. (34))

\begin{matrix} {\bar{V}}^{(K)} = \sum_{k = 1}^{K} {\bar{v}}^{k} \otimes ({\bar{b}}^{k} {({\bar{b}}^{k})}^{†}) . \end{matrix}

(36)

Because the Euclidean norm regularization

{∥V_{f} - {\bar{V}}^{(K)}∥}_{F}^{2}

in (35) mixes long-term and short-term dependencies due to (34) and (36), we propose the modification of problem (35)

\begin{matrix} V^{(K), *} (Z_{f}^{tr} | {{\bar{b}}^{k}, {\bar{v}}^{k}}_{k = 1}^{K}) & = \underset{\begin{matrix} {\hat{B}}_{f}, v_{f}^{1}, \dots, v_{f}^{K} \\ V_{f}^{(K)} = \sum_{k = 1}^{K} v_{f}^{k} \otimes ({\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†}) \end{matrix}}{arg min} {| | X_{f}^{tr} V_{f}^{(K)} - Y_{f}^{tr} {| |}_{F}^{2} + λ_{2} \sum_{k = 1}^{K} | | v_{f}^{k} - {\bar{v}}^{k} {| |}^{2} \\ - λ_{1} \sum_{k = 1}^{K} tr ({({\hat{b}}_{f}^{k})}^{†} ({\bar{b}}^{k} {({\bar{b}}^{k})}^{†}) {\hat{b}}_{f}^{k})}, \\ subject to {\hat{B}}_{f}^{†} {\hat{B}}_{f} = I_{K}, \end{matrix}

(37)

with hyperparameters

(λ_{1}, λ_{2}, {\bar{b}}^{1}, \dots, {\bar{b}}^{K}, {\bar{v}}^{1}, \dots, {\bar{v}}^{K})

given by the scalars

λ_{1}, λ_{2} > 0

, by the

S \times 1

long-term bias vectors

{\bar{b}}^{1}, \dots, {\bar{b}}^{K}

, and by the

N \times 1

short-term bias vectors

{\bar{v}}^{1}, \dots, {\bar{v}}^{K}

. For each feature k, the considered regularization minimizes the Euclidean distance between the short-term prediction vector

v_{f}^{k}

and the short-term bias vector

{\bar{v}}^{k}

as in Section 3, while maximizing the alignment between the long-term feature vector

{\hat{b}}_{f}^{k}

and the long-term bias vector

{\bar{b}}^{k}

in a manner akin to the kernel alignment method of [56].

To address problem (37), inspired by [57,58], we propose a sequential approach, in which the pair

(v_{f}^{k}, {\hat{b}}_{f}^{k})

consisting of the k-th predictor

v_{f}^{k}

and the k-th feature vector

{\hat{b}}_{f}^{k}

is optimized in the order

k = 1, 2, \dots, K

. Specifically, at each step k, we consider the problem

\begin{matrix} {\hat{b}}_{f}^{k, *}, v_{f}^{k, *} = \underset{\begin{matrix} {\hat{b}}_{f}^{k}, v_{f}^{k} \\ {(V_{f}^{(K)})}^{k} = v_{f}^{k} \otimes ({\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†}) \end{matrix}}{arg min} {| | X_{f}^{tr} {(V_{f}^{(K)})}^{k} - {(Y_{f}^{tr})}^{k} {| |}_{F}^{2} & - λ_{1} tr ({({\hat{b}}_{f}^{k})}^{†} ({\bar{b}}^{k} {({\bar{b}}^{k})}^{†}) {\hat{b}}_{f}^{k}) + λ_{2} | | v_{f}^{k} - {\bar{v}}^{k} {| |}^{2}}, \\ subject to {({\hat{b}}_{f}^{k})}^{†} {\hat{b}}_{f}^{k} = 1, \end{matrix}

(38)

where the

L^{tr} \times S

k-th residual target matrix

{(Y_{f}^{tr})}^{k}

is defined as [57,58]

\begin{array}{l} {(Y_{f}^{tr})}^{k} = \{\begin{cases} Y_{f}^{tr}, for k = 1, \\ Y_{f}^{tr} - \sum_{k^{'} = 1}^{k - 1} X_{f}^{tr} {(V_{f}^{(K)})}^{k^{'}, *}, for k > 1, \end{cases} \end{array}

(39)

given the k-th predictor

\begin{matrix} {(V_{f}^{(K)})}^{k} = v_{f}^{k} \otimes ({\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†}) \end{matrix}

(40)

and k-th optimized predictor

\begin{matrix} {(V_{f}^{(K)})}^{k, *} = v_{f}^{k, *} \otimes ({\hat{b}}_{f}^{k, *} {({\hat{b}}_{f}^{k, *})}^{†}) . \end{matrix}

(41)

Because (38) is a nonconvex problem, we consider alternating least squares (ALS) [59] to obtain the optimal solution

{{\hat{b}}_{f}^{k, *}, v_{f}^{k, *}}

by iterating between the following steps: (i) for a fixed

{\hat{b}}_{f}^{k}

, update

v_{f}^{k}

as

\begin{matrix} v_{f}^{k} \leftarrow \underset{\begin{matrix} v_{f}^{k} \\ {(V_{f}^{(K)})}^{k} = v_{f}^{k} \otimes {({\hat{b}}_{f}^{k} ({\hat{b}}_{f}^{k}))}^{†} \end{matrix}}{arg min} \{| | X_{f}^{tr} {(V_{f}^{(K)})}^{k} - {(Y_{f}^{tr})}^{k} {| |}_{F}^{2} + λ_{2} | | v_{f}^{k} - {\bar{v}}^{k} {| |}^{2}\}; \end{matrix}

(42)

and (ii) for a fixed

v_{f}^{k}

, update

{\hat{b}}_{f}^{k}

as

\begin{matrix} {\hat{b}}_{f}^{k} \leftarrow \underset{\begin{matrix} {\hat{b}}_{f}^{k} \\ {(V_{f}^{(K)})}^{k} = v_{f}^{k} \otimes ({\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†}) \end{matrix}}{arg min} {| | X_{f}^{tr} {(V_{f}^{(K)})}^{k} - {(Y_{f}^{tr})}^{k} {| |}_{F}^{2} & - λ_{1} tr ({({\hat{b}}_{f}^{k})}^{†} ({\bar{b}}^{k} {({\bar{b}}^{k})}^{†}) {\hat{b}}_{f}^{k})}, \\ subject to {({\hat{b}}_{f}^{k})}^{†} {\hat{b}}_{f}^{k} = 1, \end{matrix}

(43)

until convergence. Closed-form solutions for (42) and (43) can be found in Appendix B, and the overall LSTD-based conventional learning scheme can be found in Algorithm 1.

Algorithm 1: LSTD-based conventional learning for channel prediction for

S \geq 1

4.5. Transfer Learning for LSTD-Based Prediction

Similar to conventional learning, transfer learning for LSTD-based prediction can be addressed from the naïve extension (21) by utilizing the LSTD parametrization

V^{(K)}

in (34) in lieu of the unconstrained predictor V to obtain the bias matrix

{\bar{V}}^{(K), trans}

as

\begin{matrix} {\bar{V}}^{(K), trans} = \underset{\begin{matrix} \hat{B}, v^{1}, \dots, v^{K} \\ V^{(K)} = \sum_{k = 1}^{K} v^{k} \otimes ({\hat{b}}^{k} {({\hat{b}}^{k})}^{†}) \end{matrix}}{arg min} \{\sum_{f = 1}^{F} | | X_{f} V^{(K)} - Y_{f} {| |}_{F}^{2}\}, \\ subject to {\hat{B}}^{†} \hat{B} = I_{K}, \end{matrix}

(44)

which can also be solved via the ALS-based sequential approach detailed in Section 4.4. This produces the sequences

{\bar{b}}^{1, trans}, \dots, {\bar{b}}^{K, trans}

and

{\bar{v}}^{1, trans}, \dots, {\bar{v}}^{K, trans}

as (cf. (38))

\begin{matrix} {\bar{b}}^{k, trans}, {\bar{v}}^{k, trans} = \underset{\begin{matrix} {\hat{b}}^{k}, v^{k} \\ {(V^{(K)})}^{k} = v^{k} \otimes ({\hat{b}}^{k} {({\hat{b}}^{k})}^{†}) \end{matrix}}{arg min} & \{\sum_{f = 1}^{F} | | X_{f} {(V^{(K)})}^{k} - {(Y_{f})}^{k} {| |}_{F}^{2}\}, \\ subject to {({\hat{b}}^{k})}^{†} {\hat{b}}^{k} = 1, \end{matrix}

(45)

where the residual target matrix

{(Y_{f})}^{k}

is defined as (cf. (39))

\begin{array}{l} {(Y_{f})}^{k} = \{\begin{cases} Y_{f}, for k = 1, \\ Y_{f} - \sum_{k^{'} = 1}^{k - 1} X_{f} {(V_{f}^{(K)})}^{k^{'}, trans}, for k > 1 \end{cases} \end{array}

(46)

with k-th optimized predictor

\begin{matrix} {(V^{(K)})}^{k, trans} = {\bar{v}}^{k, trans} \otimes {({\bar{b}}^{k, trans} ({\bar{b}}^{k, trans}))}^{†} . \end{matrix}

(47)

Details for transfer learning can be found in Appendix C, and the overall transfer learning scheme for LSTD prediction is summarized in Algorithm 2. After transfer learning, similar to Section 3.2, based on the optimized hyperparameters

{\bar{b}}^{1, trans}, \dots, {\bar{b}}^{K, trans}

and

{\bar{v}}^{1, trans}, \dots, {\bar{v}}^{K, trans}

, the LSTD-based channel predictor for a new frame

f^{new}

can be obtained via (37) as

\begin{matrix} V_{f^{new}}^{(K), *} = V^{(K), *} (Z_{f^{new}}^{tr} | {{\bar{b}}^{k, trans}, {\bar{v}}^{k, trans}}_{k = 1}^{K}), \end{matrix}

(48)

which can also be solved in the sequential way as in (38).

Algorithm 2: LSTD-based transfer-learning for channel prediction for

S \geq 1

4.6. Meta-Learning for LSTD-Based Prediction

Plugging (37) into the naïve extension of (22), we can formulate the meta-learning problem for LSTD-based prediction as

\begin{matrix} min_{{{\bar{b}}^{k}, {\bar{v}}^{k}}_{k = 1}^{K}} \sum_{f = 1}^{F} {∥X_{f}^{te} V^{(K), *} (Z_{f}^{tr} | {{\bar{b}}^{k}, {\bar{v}}^{k}}_{k = 1}^{K}) - Y_{f}^{te}∥}_{F}^{2} . \end{matrix}

(49)

Similar to the sequential approach (38) described in Section 4.4, we propose a hierarchical sequential approach for meta-learning by using (38) in the order

k = 1, \dots, K

, obtaining the problem

\begin{matrix} {\bar{b}}^{k, meta}, {\bar{v}}^{k, meta} = \underset{\begin{matrix} {\bar{b}}^{k}, {\bar{v}}^{k} \\ {(V_{f}^{(K)})}^{k, *} = v_{f}^{k, *} \otimes ({\hat{b}}_{f}^{k, *} {({\hat{b}}_{f}^{k, *})}^{†}) \end{matrix}}{arg min} \{\sum_{f = 1}^{F} {∥X_{f}^{te} {(V_{f}^{(K)})}^{k, *} - {(Y_{f}^{te})}^{k}∥}_{F}^{2}\}, \end{matrix}

(50)

with the residual target matrix

{(Y_{f}^{te})}^{k}

defined as (cf. (39))

\begin{array}{l} {(Y_{f}^{te})}^{k} = \{\begin{cases} Y_{f}^{te}, for k = 1, \\ Y_{f}^{te} - \sum_{k^{'} = 1}^{k - 1} X_{f}^{te} {(V_{f}^{(K)})}^{k^{'}, *}, for k > 1 . \end{cases} \end{array}

(51)

The bilevel non-convex optimization problem (50) is addressed through gradient-based updates with gradients computed via equilibrium propagation (EP) [60,61]. EP uses finite differentiation to approximate the gradient of the bilevel optimization (50), where the difference is computed between two gradients obtained at two stationary points

({\hat{b}}_{f}^{k, *}, v_{f}^{k, *})

and

({\hat{b}}_{f}^{k, α}, v_{f}^{k, α})

for the original problem (38) and modified version of (38) that considers additional prediction loss for the test set

Z_{f}^{te}

. Specifically, EP leverages the asymptotic equality [60]

\begin{matrix} \nabla_{{\bar{b}}^{k}} & \sum_{f = 1}^{F} {∥X_{f}^{te} {(V_{f}^{(K)})}^{k, *} - {(Y_{f}^{te})}^{k}∥}_{F}^{2} = lim_{α \to 0} \frac{2 λ_{1}}{α} \sum_{f = 1}^{F} ({\hat{b}}_{f}^{k, *} {({\hat{b}}_{f}^{k, *})}^{†} - {\hat{b}}_{f}^{k, α} {({\hat{b}}_{f}^{k, α})}^{†}) {\bar{b}}^{k} \end{matrix}

(52)

and

\begin{matrix} \nabla_{{\bar{v}}^{k}} \sum_{f = 1}^{F} & {∥X_{f}^{te} {(V_{f}^{(K)})}^{k, *} - {(Y_{f}^{te})}^{k}∥}_{F}^{2} = lim_{α \to 0} \frac{2 λ_{2}}{α} \sum_{f = 1}^{F} (v_{f}^{k, *} - v_{f}^{k, α}), \end{matrix}

(53)

with additional real-valued hyperparameter

α \in R

, which is generally chosen to be a non-zero small value [60,61]. In (52) and (53), vectors

{\hat{b}}_{f}^{k, α}

and

v_{f}^{k, α}

are defined as (cf. (38))

\begin{matrix} {\hat{b}}_{f}^{k, α}, v_{f}^{k, α} = \underset{\begin{matrix} {\hat{b}}_{f}^{k}, v_{f}^{k} \\ {(V_{f}^{(K)})}^{k} = v_{f}^{k} \otimes ({\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†}) \end{matrix}}{arg min} {| | X_{f}^{tr} {(V_{f}^{(K)})}^{k} & - {(Y_{f}^{tr})}^{k} {| |}_{F}^{2} + α | | X_{f}^{te} {(V_{f}^{(K)})}^{k} - {(Y_{f}^{te})}^{k} {| |}_{F}^{2} \\ - λ_{1} tr ({({\hat{b}}_{f}^{k})}^{†} ({\bar{b}}^{k} {({\bar{b}}^{k})}^{†}) {\hat{b}}_{f}^{k}) + λ_{2} | | v_{f}^{k} - {\bar{v}}^{k} {| |}^{2}}, \\ subject to {({\hat{b}}_{f}^{k})}^{†} {\hat{b}}_{f}^{k} = 1 . \end{matrix}

(54)

Derivations for the gradients (52) and (53) can be found in Appendix D.

To reduce the computational complexity for the gradient-based updates, we adopt stochastic gradient descent with the Adam optimizer as done in [61] in order to update

{\bar{b}}^{k}

and

{\bar{v}}^{k}

based on (52) and (53). The overall LSTD-based meta-learning scheme is detailed in Algorithm 3.

After meta-learning, as in Section 3.3, based on the optimized

{\bar{b}}^{1, meta}, \dots, {\bar{b}}^{K, meta}

and

{\bar{v}}^{1, meta}, \dots, {\bar{v}}^{K, meta}

, LSTD-based channel predictor for a new frame

f^{new}

can be obtained via (37) as

\begin{matrix} V_{f^{new}}^{(K), *} = V^{(K), *} (Z_{f^{new}}^{tr} | {{\bar{b}}^{k, meta}, {\bar{v}}^{k, meta}}_{k = 1}^{K}), \end{matrix}

(55)

which can be solved in a sequential way, as in (38).

The computational complexity order of the considered schemes is summarized in Table 1 and Table 2. At deployment time, as seen in Table 1, all schemes require the same computational complexity of conventional learning. In contrast, in the offline meta-learning or transfer learning phase, the computational overhead depends on the dimension of the channel vector

S = N_{R} N_{T} W

. LSTD-based schemes can reduce the computational overhead as compared to naïve solutions when the channel vector is large, i.e.,

S ≫ 1

, and the rank K is sufficiently small. This is quantified in Table 2, where

I_{ALS}

is the number of iterations for ALS and

I_{EP}

is the number of iterations for EP.

Algorithm 3: LSTD-based meta-learning for channel prediction for

S \geq 1

4.7. Rank-Estimation for LSTD-Based Prediction

The number of total features K for LSTD-based predictions depends on the rank of the unknown space-time signature matrix

T_{f}

as discussed in Section 4.2. This rank can be estimated by using available channels from previous frames if we assume that the number of total features does not change over multiple frames. This can be achieved via one of the standard methods, Akaike’s information theoretic criterion (AIC) (Equation (16) in [62]), which is applicable for all the proposed LSTD-based techniques. However, as the AIC-based rank estimation generally tends to be overestimated [62,63], we propose a potentially more effective estimator for meta-learning, which utilizes a validation dataset.

To this end, we first split the available F frames into

F^{tr}

meta-training frames

f = 1, \dots, F^{tr}

and

F^{val}

meta-validation frames

f = F^{tr} + 1, \dots, F

. Then, we compute the sum-loss as (cf. (49))

\begin{matrix} \sum_{f = F^{tr} + 1}^{F^{tr} + F^{val}} ∥ X_{f}^{te} V^{(k), *} (Z_{f}^{tr} | {{\bar{b}}^{k^{'}, meta}, {\bar{v}}^{k^{'}, meta}}_{k^{'} = 1}^{k}) - Y_{f}^{te} ∥_{F}^{2}, \end{matrix}

(56)

where the hyperparameters

{{\bar{b}}^{k^{'}, meta}, {\bar{v}}^{k^{'}, meta}}_{k^{'} = 1}^{k}

are computed by using the

F^{tr}

meta-training frames, as explained in the previous section. The rank-estimation procedure sequentially evaluates the meta-validation loss (56) in order to minimize it over the selection of k. In this regard, it is worth noting that an increase in the total number of features k always decreases the meta-training loss in (49), whereas this is not necessarily true for the meta-validation and meta-test losses.

5. Experiments

In this section, we present experimental results for the prediction of multi-antenna and/or frequency-selective channels. Numerical examples for single-antenna frequency-flat channels for both offline and online learning scenarios can be found in the conference version of this paper [45]. For all the experiments, we compute the normalized mean squared error (NMSE)

| | {\hat{h}}_{l + δ, f} - h_{l + δ, f} {| |}^{2} / | | h_{l + δ, f} {| |}^{2}

, which is averaged over 100 samples for 200 new frames. To avoid discrepancies between the evaluation measures used during the training and testing phase, we also adopt the NMSE as the training loss function by normalizing the training dataset for the new frame

f^{new}

as (cf. (3))

\begin{matrix} Z_{f^{new}}^{tr} = {(x_{i, f^{new}}, y_{i, f^{new}})}_{i = 1}^{L^{tr}} \\ \equiv {\{(vec (H_{l, f^{new}}^{N}) / | | h_{l + δ, f^{new}} | |, h_{l + δ, f^{new}} / | | h_{l + δ, f^{new}} | |)\}}_{l = N}^{L^{tr} + N - 1}, \end{matrix}

(57)

and similarly redefine the datasets from previous frames

f = 1, \dots, F

for transfer and meta-training as (cf. (9))

\begin{matrix} Z_{f} = {(x_{i, f}, y_{i, f})}_{i = 1}^{L} \\ \equiv {\{(vec (H_{l, f}^{N}) / | | h_{l + δ, f} | |, h_{l + δ, f} / | | h_{l + δ, f} | |)\}}_{l = N}^{L + N - 1} . \end{matrix}

(58)

As summarized in Table 3, we consider a window size

N = 5

with lag size

δ = 3

. All of the experimental results follow the 3GPP 5G standard SCM channel model [46] with variations of the long-term features over frames following Clause 7.6.3.2 (Procedure B) [46], under the Umi–Street Canyon environment, as discussed in Section 2.2. The normalized Doppler frequency

ρ = γ_{d, f} / γ_{SRS} \in [0, 1]

within each frame f, defined as the ratio between the Doppler frequency

γ_{d, f}

(4) and the frequency of the pilot symbols

γ_{SRS}

, or sounding reference signal (SRS) [46], is randomly selected in one of the two following ways: (i) for slow-varying environments, it is uniformly drawn in the interval

[0.005, 0.05]

; and (ii) for fast-varying environments, it is uniformly distributed in the interval

[0.1, 1]

. In the following, we study the impact of (i) the number of antennas

N_{R} N_{T}

, (ii) the number of channel taps W, (iii) the number of training samples

L^{new}

, and (iv) the number of previous frames F, for various prediction schemes: (a) conventional learning, (b) transfer learning, and (c) meta-learning, where each scheme is implemented by using either the naïve or the LSTD parametrization. We set

λ = 0

,

λ_{1} = 0

, and

λ_{2} = 0

for conventional learning [9,45], whereas

λ = 1

,

λ_{1} = 1

, and

λ_{2} = 1

for transfer and meta-learning.

5.1. Multi-Antenna Frequency-Flat Channels

We begin by considering multi-antenna frequency-flat channels and evaluating the NMSE as a function of total number of antennas

N_{R} N_{T}

under a fast-varying environment (Figure 4) or a slow-varying environment (Figure 5). We set

K = 1

in the LSTD model. Specific antenna configurations are described in Appendix E. Both transfer and meta-learning are seen to provide significant advantages as compared to conventional learning, as long as one chooses the type of parametrization—naïve or LSTD—as a function of the type of variability in the channel, with meta-learning generally outperforming transfer learning. In particular, as seen in Figure 4, for fast-varying environments, meta-learning with LSTD parametrization has the best performance, significantly reducing the NMSE with respect to both conventional and transfer learning. This is because meta-learning with LSTD can account for the need to adapt to fast-varying channel conditions, while also leveraging the reduced-rank structure of the channel. In contrast, as shown in Figure 5, for slow-varying channels, naïve parametrization tends to be preferable, because, as explained in Section 4.2, long-term and short-term features of the channel become indistinguishable when channel variability is too low. It is also interesting to observe that increasing the number of antennas is generally useful for prediction, as the predictor can build on a larger vector of correlated covariates. This is, however, not the case for conventional learning in slow-varying environments, for which the features tend to be too correlated, resulting in overfitting. As a final note, although absolute NMSE values close to 1 may be insufficient for use in applications such as precoding, they can provide useful information for other applications such as proactive resource allocation [40,64].

5.2. Rank Estimation

In the previous experiments, we have considered channels with unitary rank, for which one can assume without loss of optimality a number of features in the LSTD parametrization equal to

K = 1

. In order to implement predictors for multi-antenna frequency-selective channels, one instead needs to first address the problem of estimating the number of features. Here, we evaluate the performance of the approach proposed in Section 4.7 for rank estimation. To this end, we set the number of antennas as

N_{R} = 8

and

N_{T} = 8

, and consider the 19-clustered channel model with delay spread ratio 2. Figure 6 shows the NMSE evaluated on the meta-training, meta-validation, and meta-test data sets as a function of total number of features K. The meta-training set contains 20 frames, the meta-test 200 frames, and the meta-validation set 20 frames. The meta-training loss is monotonically decreasing with K, because a richer parametrization enables a closer fit of the training data. In contrast, both meta-test and meta-validation loss are optimized for an intermediate value of K. The main point of the figure is that the meta-validation loss, while only containing 20 frames, provides useful information to choose a value of K that approximately minimizes the meta-test loss. In contrast, although we can see that

K = 3

is a proper estimate of the channel rank for the considered set-up, AIC-based rank estimation gives the highly overestimated value

K = 200

, which deteriorates the prediction performance, as can be seen in Figure 6. Throughout the following experiments, we will follow the proposed procedure to select K for meta-learning, whereas for all the other schemes, we adopt AIC-based rank estimation to determine K.

5.3. Single-Antenna Frequency-Selective Channels

Before considering multi-antenna frequency-selective channels, we first consider the impact of the level of frequency selectivity on the prediction of single-antenna frequency-selective channels. To this end, starting from

45 ns

, we increase the delay spread by a multiplicative factor, and correspondingly also increase the number of taps by the same amount, which is referred to as delay spread ratio in Figure 6. The number of taps W is obtained as the smallest number of taps that contains more than 90% of the average channel power, following ITU-R report [65]. Figure 7 shows that the dependence on the delay spread of the channel is qualitatively similar to the dependence on the number of antennas in Figure 4 and Figure 5, with the top of Figure 7 representing the performance under a fast-varying environment and the bottom figure depicting the NMSE for a slow-varying environment. Accordingly, as discussed in the previous subsection, meta-learning outperforms both transfer and conventional learning, as long as the parametrization is correctly selected: naïve for slow-varying channels, and LSTD for fast-varying environments.

5.4. Multi-Antenna Frequency-Selective Channel Case

We now consider the prediction performance for multi-antenna frequency-selective channels as a function of the number of training samples

L^{new}

in Figure 8 and Figure 9, as well as versus the number of frames F in Figure 10. For meta-learning, we set

L^{tr} = L^{new}

in order to avoid discrepancies between meta-training and meta-testing [29]. Figure 8 and Figure 9 shows that meta-learning and transfer learning, which utilize

F = 500

previous frames, can significantly outperform conventional learning in terms of number of required pilots

L^{new}

. This key observation motivates the use of transfer and meta-learning in the presence of limited training data. Furthermore, confirming the analysis in Section 3.3 and Section 4.6, meta-learning can outperform all other schemes as long as one selects a naïve parametrization for slowly varying environments, and the LSTD parametrization for fast-varying environments. For sufficiently large

L^{new}

, transfer learning can, however, improve over meta-learning on fast-varying environments, as seen in Figure 8. This stems from the split of training and testing set applied by meta-learning, which can lead to a performance loss as

L^{new}

increases.

Lastly, we investigate the effect of the number of previous frames F for transfer and meta-learning. As a general result, as demonstrated by Figure 10, an increase in the number F of previous frames results in better performance for both transfer and meta-learning. Furthermore, in a slow-varying environment with a small value of F, transfer learning can outperform meta-learning due to the limited need for adaptation, whereas meta-learning with the correctly select type of parametrization, outperforms transfer learning otherwise.

6. Conclusions

In this paper, we have introduced data-driven channel prediction strategies for multi-antenna frequency-selective channels that aim at reducing the number of pilots by integrating transfer and meta-learning with a novel parametrization of linear predictors. The methods leverage the underlying structure of the wireless channels, which can be expressed in terms of a long short-term decomposition (LSTD) into long-term space-time features and fading amplitudes. To enable transfer and meta-learning under an LSTD-based model, we have proposed an optimization strategy based on equilibrium propagation (EP) and alternating least squares (ALS). Numerical experiments have shown that the proposed LSTD-based transfer and meta-learning methods far outperform conventional prediction methods, especially in the few-pilots regime. For instance, under a standard 3GPP SCM channel model, assuming four transmit antennas and two receive antennas, using only one pilot meta-learning with LSTD can reduce the normalized prediction MSE by 3 dB as compared to standard learning techniques. Future work may consider the joint use of deep neural networks, in lieu of linear prediction filters, although related results for multi-antenna frequency-flat channels have not reported any significant advantage to date [14,15,16,19].

Author Contributions

Conceptualization, software, formal analysis, writing, S.P.; conceptualization, supervision, writing, project administration, funding acquisition, O.S. All authors have read and agreed to the published version of the manuscript.

Funding

The work of S.P and O.S. was partly supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 725731).

Data Availability Statement

Code is available at https://github.com/kclip/channel-prediction-meta-learning (accessed on 22 September 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Derivation for Long-Short-Term-Decomposition (LSTD)-Based Predictor $V_{f}^{(K)}$

Because the k-th amplitudes

{\hat{d}}_{l - N + 1, f}^{k}, \dots, {\hat{d}}_{l + δ, f}^{k}

are all scalar values, we can write

vec ([{\hat{d}}_{l, f}^{k}, \dots, {\hat{d}}_{l - N + 1, f}^{k}]) = {[{\hat{d}}_{l, f}^{k}, \dots, {\hat{d}}_{l - N + 1, f}^{k}]}^{⊤}

and

{\hat{d}}_{l + δ, f}^{k} = {({\hat{d}}_{l + δ, f}^{k})}^{⊤}

. By using these equalities, we can plug in (30) and (31) to (32), to get the expression of the predicted channel

{\hat{h}}_{l + δ, f}

as

\begin{matrix} {\hat{h}}_{l + δ, f} & = \sum_{k = 1}^{K} {\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†} H_{l, f}^{N} {({(v_{f}^{k})}^{†})}^{⊤} \\ = \underset{{(V_{f}^{(K)})}^{†} from (33)}{\underset{⏟}{\sum_{k = 1}^{K} ({(v_{f}^{k})}^{†} \otimes ({\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†}))}} vec (H_{l, f}^{N}), \end{matrix}

(A1)

from which we can easily obtain LSTD-based predictor matrix

V_{f}^{(K)} = \sum_{k = 1}^{K} v_{f}^{k} \otimes ({\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†})

.

Appendix B. Details on Conventional Learning for LSTD-Based Prediction

Recalling from (3) that

X_{f}^{tr} = {[vec {(H_{1, f}^{N})}^{†}, \dots, vec {(H_{L^{tr}, f}^{N})}^{†}]}^{⊤}

, we can rewrite one part of ALS (42) in the form of standard ridge regression formula as

\begin{matrix} v_{f}^{k} & \leftarrow \underset{\begin{matrix} v_{f}^{k} \\ {(V_{f}^{(K)})}^{k} = v_{f}^{k} \otimes {({\hat{b}}_{f}^{k} ({\hat{b}}_{f}^{k}))}^{†} \end{matrix}}{arg min} \{| | X_{f}^{tr} {(V_{f}^{(K)})}^{k} - {(Y_{f}^{tr})}^{k} {| |}_{F}^{2} + λ_{2} | | v_{f}^{k} - {\bar{v}}^{k} {| |}^{2}\} \\ = \underset{v_{f}^{k}}{arg min} \{\sum_{i = 1}^{L^{tr}} {∥{\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†} H_{i, f}^{N} {({(v_{f}^{k})}^{†})}^{⊤} - {(y_{i, f}^{tr})}^{k}∥}^{2} + λ_{2} {∥v_{f}^{k} - {\bar{v}}^{k}∥}^{2}\} \end{matrix}

(A2)

with

{(y_{i, f}^{tr})}^{k}

being the Hermitian transposition of the i-th row of the k-th residual target matrix

{(Y_{f}^{tr})}^{k}

defined in (39), which can be solved in closed-form similar to (14) as

\begin{matrix} {({(v_{f}^{k})}^{†})}^{⊤} & = {(\sum_{i = 1}^{L^{tr}} {(H_{i, f}^{N})}^{†} {\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†} H_{i, f}^{N} + λ_{2} I)}^{- 1} (\sum_{i = 1}^{L^{tr}} {(H_{i, f}^{N})}^{†} {\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†} {(y_{i, f}^{tr})}^{k} + λ_{2} {\bar{v}}^{k}) . \end{matrix}

(A3)

The initialization of the parameter

{\hat{b}}_{f}^{k}

is set to the available hyperparameter

{\bar{b}}^{k}

.

Similarly, the other part of ALS (43) can be rewritten as

\begin{matrix} {\hat{b}}_{f}^{k} & \leftarrow \underset{{\hat{b}}_{f}^{k}}{arg min} \{\sum_{i = 1}^{L^{tr}} {∥{\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†} H_{i, f}^{N} {({(v_{f}^{k})}^{†})}^{⊤} - {(y_{i, f}^{tr})}^{k}∥}^{2} - λ_{1} tr ({({\hat{b}}_{f}^{k})}^{†} ({\bar{b}}^{k} {({\bar{b}}^{k})}^{†}) {\hat{b}}_{f}^{k})\} \\ subject to b_{k}^{†} b_{k} = 1, \\ = \underset{{\hat{b}}_{f}^{k}}{arg min} \{{({\hat{b}}_{f}^{k})}^{†} ({({\overset{ˇ}{X}}_{k, f})}^{†} ({\overset{ˇ}{X}}_{k, f}) - {({\overset{ˇ}{Y}}_{k, f})}^{†} ({\overset{ˇ}{X}}_{k, f}) - {({\overset{ˇ}{X}}_{k, f})}^{†} ({\overset{ˇ}{Y}}_{k, f})) {\hat{b}}_{f}^{k}\} \\ subject to b_{k}^{†} b_{k} = 1, \end{matrix}

(A4)

for which the solution of (A4) can be obtained by taking the eigenvector of

({({\overset{ˇ}{X}}_{k, f})}^{†} ({\overset{ˇ}{X}}_{k, f}) - {({\overset{ˇ}{Y}}_{k, f})}^{†} ({\overset{ˇ}{X}}_{k, f}) - {({\overset{ˇ}{X}}_{k, f})}^{†} ({\overset{ˇ}{Y}}_{k, f}))

that corresponds to the smallest eigenvalue, with the matrices

{\overset{ˇ}{X}}_{k, f}

and

{\overset{ˇ}{Y}}_{k, f}

defined as

{\overset{ˇ}{X}}_{k, f} = (\begin{matrix} {({\overset{ˇ}{X}}_{f}^{tr})}^{k} \\ \sqrt{λ_{1}} {({\bar{b}}^{k})}^{†} \end{matrix}), {\overset{ˇ}{Y}}_{k, f} = (\begin{matrix} {(Y_{f}^{tr})}^{k} \\ \sqrt{λ_{1}} {({\bar{b}}^{k})}^{†} \end{matrix}),

(A5)

where we denote

{({\overset{ˇ}{X}}_{f}^{tr})}^{k} = {[{({\overset{ˇ}{x}}_{1, f}^{k})}^{†}, \dots, {({\overset{ˇ}{x}}_{L^{tr}, f}^{k})}^{†}]}^{⊤}

given

{\overset{ˇ}{x}}_{i, f}^{k} = H_{i, f}^{N} {({(v_{f}^{k})}^{†})}^{⊤}

. Note that we arbitrarily started ALS with vector

v_{f}^{k}

, as this ordering did not show any meaningful impact on the final results, as also reported in [66].

Appendix C. Details on Transfer Learning for LSTD-Based Prediction

Solution of (45) can be directly obtained with the tools in Appendix B given

λ_{1}, λ_{2} = 0

and by substituting

X_{f}^{tr}

and

Y_{f}^{tr}

with

{[X_{1}, \dots, X_{F}]}^{⊤}

and

{[Y_{1}, \dots, Y_{F}]}^{⊤}

, respectively.

Appendix D. Details on Meta-Learning for LSTD-Based Prediction

Before deriving (52) and (53), for ease of representation, let us define inner loss function

L_{f}^{inner}

, outer loss function

L_{f}^{outer}

, and total loss function

L_{f}^{total}

as

\begin{matrix} L_{f}^{inner} ({\hat{b}}_{f}^{k}, v_{f}^{k} | {\bar{b}}^{k}, {\bar{v}}^{k}) = {∥\frac{X_{f}^{tr} {(V_{f}^{(K)})}^{k}}{{({\hat{b}}_{f}^{k})}^{†} {\hat{b}}_{f}^{k}} - {(Y_{f}^{tr})}^{k}∥}_{F}^{2} - λ_{1} \frac{{({\hat{b}}_{f}^{k})}^{†} ({\bar{b}}^{k} {({\bar{b}}^{k})}^{†}) {\hat{b}}_{f}^{k}}{{({\hat{b}}_{f}^{k})}^{†} {\hat{b}}_{f}^{k}} + λ_{2} | | v_{f}^{k} - {\bar{v}}^{k} {| |}^{2}, \end{matrix}

(A6)

\begin{matrix} L_{f}^{outer} ({\hat{b}}_{f}^{k}, v_{f}^{k}) = {∥\frac{X_{f}^{te} {(V_{f}^{(K)})}^{k}}{{({\hat{b}}_{f}^{k})}^{†} {\hat{b}}_{f}^{k}} - {(Y_{f}^{te})}^{k}∥}_{F}^{2}, \end{matrix}

(A7)

and

\begin{matrix} L_{f}^{total} ({\hat{b}}_{f}^{k}, v_{f}^{k} | {\bar{b}}^{k}, {\bar{v}}^{k}, α) = L_{f}^{inner} ({\hat{b}}_{f}^{k}, v_{f}^{k} | {\bar{b}}^{k}, {\bar{v}}^{k}) + α L_{f}^{outer} ({\hat{b}}_{f}^{k}, v_{f}^{k}), \end{matrix}

(A8)

respectively. Because (A6) is scale-invariant to

{({\hat{b}}_{f}^{k})}^{†} {\hat{b}}_{f}^{k}

(recall that

{(V_{f}^{(K)})}^{k} = v_{f}^{k} \otimes ({\hat{b}}_{f}^{k} {({\hat{b}}_{f}^{k})}^{†})

), this can be considered as an unconstrained version of (38), i.e.,

({\hat{b}}_{f}^{k, *}, v_{f}^{k, *})

in (38) minimizes (A6). Analogously, (A8) can be considered as an unconstrained expression of (54) as

({\hat{b}}_{f}^{k, α}, v_{f}^{k, α})

in (54) minimizes (A8).

Assuming that the conditions of implicit function theorem [67] are satisfied with respect to

({\hat{b}}_{f}^{k, α}, v_{f}^{k, α})

, from the finite differentiation method [60], we can write the gradients for meta-learning as

\begin{matrix} \nabla_{{\bar{b}}^{k}} & \sum_{f = 1}^{F} {∥X_{f}^{te} {(V_{f}^{(K)})}^{k, *} - {(Y_{f}^{te})}^{k}∥}_{F}^{2} \\ = lim_{α \to 0} \sum_{f = 1}^{F} \frac{1}{α} (\frac{\partial L_{f}^{total}}{\partial {\bar{b}}^{k}} ({\hat{b}}_{f}^{k, α}, v_{f}^{k, α} | {\bar{b}}^{k}, {\bar{v}}^{k}, α) - \frac{\partial L_{f}^{total}}{\partial {\bar{b}}^{k}} ({\hat{b}}_{f}^{k, *}, v_{f}^{k, *} | {\bar{b}}^{k}, {\bar{v}}^{k}, 0)) \\ = lim_{α \to 0} \frac{2 λ_{1}}{α} \sum_{f = 1}^{F} ({\hat{b}}_{f}^{k, *} {({\hat{b}}_{f}^{k, *})}^{†} - {\hat{b}}_{f}^{k, α} {({\hat{b}}_{f}^{k, α})}^{†}) {\bar{b}}^{k} \end{matrix}

(A9)

and

\begin{array}{l} \nabla_{{\bar{v}}^{k}} & \sum_{f = 1}^{F} {∥X_{f}^{te} {(V_{f}^{(K)})}^{k, *} - {(Y_{f}^{te})}^{k}∥}_{F}^{2} \\ = lim_{α \to 0} \sum_{f = 1}^{F} \frac{1}{α} (\frac{\partial L_{f}^{total}}{\partial {\bar{v}}^{k}} ({\hat{b}}_{f}^{k, α}, v_{f}^{k, α} | {\bar{b}}^{k}, {\bar{v}}^{k}, α) - \frac{\partial L_{f}^{total}}{\partial {\bar{v}}^{k}} ({\hat{b}}_{f}^{k, *}, v_{f}^{k, *} | {\bar{b}}^{k}, {\bar{v}}^{k}, 0)) \\ = lim_{α \to 0} \frac{2 λ_{2}}{α} \sum_{f = 1}^{F} (v_{f}^{k, *} - v_{f}^{k, α}), \end{array}

(A10)

which concludes the derivation of (52) and (53). A generalized proof along with useful characteristics of EP can be found in [60]. For initialization, the hyperparameter

{\bar{b}}^{k}

is set as the one-hot vector at position k; while

{\bar{v}}^{k}

are chosen as all-zero vectors [52].

Appendix E. Details on the Antenna Configuration in Section 5.1

Following the table contains the specification of the antenna configurations in Section 5.1. We denote

(N_{R}^{hor}, N_{R}^{ver}, N_{R}^{pol}, N_{T}^{hor}, N_{T}^{ver}, N_{T}^{pol})

by the pair of number of horizontal receive antennas

N_{R}^{hor}

, number of vertical receive antennas

N_{R}^{ver}

, number of polarizations of receive antennas

N_{R}^{pol}

, number of horizontal transmit antennas

N_{T}^{hor}

, number of vertical transmit antennas

N_{T}^{ver}

, and number of polarizations of transmit antennas

N_{T}^{pol}

. Note that

N_{R} N_{T} = N_{R}^{hor} N_{R}^{ver} N_{R}^{pol} N_{T}^{hor} N_{T}^{ver} N_{T}^{pol}

.

Table A1. Antenna Configurations for Section 5.1.

Number of	Antenna Configuration
Total Antennas ( $N_{R} N_{T}$ )	( $N_{R}^{hor}, N_{R}^{ver}, N_{R}^{pol}, N_{T}^{hor}, N_{T}^{ver}, N_{T}^{pol}$ )
1	$(1, 1, 1, 1, 1, 1)$
2	$(1, 1, 1, 2, 1, 1)$
4	$(1, 1, 1, 2, 2, 1)$
8	$(2, 1, 1, 2, 2, 1)$
16	$(2, 1, 1, 2, 2, 2)$
32	$(2, 1, 1, 4, 2, 2)$
64	$(2, 1, 1, 4, 4, 2)$
128	$(2, 2, 1, 4, 4, 2)$

References

Tang, F.; Kawamoto, Y.; Kato, N.; Liu, J. Future intelligent and secure vehicular network toward 6G: Machine-learning approaches. Proc. IEEE 2019, 108, 292–307. [Google Scholar] [CrossRef]
Hoeher, P.; Kaiser, S.; Robertson, P. Two-dimensional pilot-symbol-aided channel estimation by Wiener filtering. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Munich, Germany, 21–24 April 1997. [Google Scholar]
Baddour, K.E.; Beaulieu, N.C. Autoregressive modeling for fading channel simulation. IEEE Trans. Wirel. Commun. 2005, 4, 1650–1662. [Google Scholar] [CrossRef]
Duel-Hallen, A.; Hu, S.; Hallen, H. Long-range prediction of fading signals. IEEE Signal Process. Mag. 2000, 17, 62–75. [Google Scholar] [CrossRef]
Liu, L.; Feng, H.; Yang, T.; Hu, B. MIMO-OFDM wireless channel prediction by exploiting spatial-temporal correlation. IEEE Trans. Wirel. Commun. 2013, 13, 310–319. [Google Scholar] [CrossRef]
Min, C.; Chang, N.; Cha, J.; Kang, J. MIMO-OFDM downlink channel prediction for IEEE802.16e systems using Kalman filter. In Proceedings of the 2007 IEEE Wireless Communications and Networking Conference (WCNC), Hong Kong, China, 11–15 March 2007; pp. 942–946. [Google Scholar]
Komninakis, C.; Fragouli, C.; Sayed, A.H.; Wesel, R.D. Multi-input multi-output fading channel tracking and equalization using Kalman estimation. IEEE Trans. Signal Process. 2002, 50, 1065–1076. [Google Scholar] [CrossRef]
Kashyap, S.; Mollén, C.; Björnson, E.; Larsson, E.G. Performance analysis of (TDD) massive MIMO with Kalman channel prediction. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 3554–3558. [Google Scholar]
Simmons, N.; Gomes, S.B.F.; Yacoub, M.D.; Simeone, O.; Cotton, S.L.; Simmons, D.E. AI-Based Channel Prediction in D2D Links: An Empirical Validation. IEEE Access 2022, 10, 65459–65472. [Google Scholar] [CrossRef]
Simon, D.; Shmaliy, Y.S. Unified forms for Kalman and finite impulse response filtering and smoothing. Automatica 2013, 49, 1892–1899. [Google Scholar] [CrossRef]
Shmaliy, Y.S. Linear optimal FIR estimation of discrete time-invariant state-space models. IEEE Trans. Signal Process. 2010, 58, 3086–3096. [Google Scholar] [CrossRef]
Pratik, K.; Amjad, R.A.; Behboodi, A.; Soriaga, J.B.; Welling, M. Neural Augmentation of Kalman Filter with Hypernetwork for Channel Tracking. arXiv 2021, arXiv:2109.12561. [Google Scholar]
Liu, W.; Yang, L.L.; Hanzo, L. Recurrent neural network based narrowband channel prediction. In Proceedings of the 2006 IEEE 63rd Vehicular Technology Conference, Melbourne, VIC, Australia, 7–10 May 2006; Volume 5, pp. 2173–2177. [Google Scholar]
Jiang, W.; Schotten, H.D. A comparison of wireless channel predictors: Artificial Intelligence versus Kalman filter. In Proceedings of the ICC 2019—2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
Jiang, W.; Strufe, M.; Schotten, H.D. Long-range MIMO channel prediction using recurrent neural networks. In Proceedings of the 2020 IEEE 17th Annual Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 10–13 January 2020. [Google Scholar]
Kibugi, J.; Ribeiro, L.N.; Haardt, M. Machine Learning Prediction of Time-Varying Rayleigh Channels. arXiv 2021, arXiv:2103.06131. [Google Scholar]
Yuan, J.; Ngo, H.Q.; Matthaiou, M. Machine learning-based channel prediction in massive MIMO with channel aging. IEEE Trans. Wirel. Commun. 2020, 19, 2960–2973. [Google Scholar] [CrossRef]
Zhang, Y.; Alkhateeb, A.; Madadi, P.; Jeon, J.; Cho, J.; Zhang, C. Predicting Future CSI Feedback For Highly-Mobile Massive MIMO Systems. arXiv 2022, arXiv:2202.02492. [Google Scholar]
Kim, H.; Kim, S.; Lee, H.; Jang, C.; Choi, Y.; Choi, J. Massive MIMO channel prediction: Kalman filtering vs. machine learning. IEEE Trans. Commun. 2020, 69, 518–528. [Google Scholar] [CrossRef]
Bogale, T.E.; Wang, X.; Le, L.B. Adaptive channel prediction, beamforming and scheduling design for 5G V2I network: Analytical and machine learning approaches. IEEE Trans. Veh. Technol. 2020, 69, 5055–5067. [Google Scholar] [CrossRef]
Torrey, L.; Shavlik, J. Transfer learning. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques; IGI Global: Hershey, PA, USA, 2010. [Google Scholar]
Thrun, S.; Pratt, L. Learning to Learn; Springer Science & Business Media: Berlin, Germany, 2012. [Google Scholar]
Raghu, A.; Raghu, M.; Bengio, S.; Vinyals, O. Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv 2019, arXiv:1909.09157. [Google Scholar]
Jose, S.T.; Park, S.; Simeone, O. Information-Theoretic Analysis of Epistemic Uncertainty in Bayesian Meta-learning. In Proceedings of the AISTATS, Virtual Event, 28–30 March 2022. [Google Scholar]
Yuan, Y.; Zheng, G.; Wong, K.K.; Ottersten, B.; Luo, Z.Q. Transfer learning and meta learning-based fast downlink beamforming adaptation. IEEE Trans. Wirel. Commun. 2020, 20, 1742–1755. [Google Scholar] [CrossRef]
Ge, Y.; Fan, J. Beamforming optimization for intelligent reflecting surface assisted MISO: A deep transfer learning approach. IEEE Trans. Veh. Technol. 2021, 70, 3902–3907. [Google Scholar] [CrossRef]
Yang, Y.; Gao, F.; Zhong, Z.; Ai, B.; Alkhateeb, A. Deep transfer learning-based downlink channel prediction for FDD massive MIMO systems. IEEE Trans. Commun. 2020, 68, 7485–7497. [Google Scholar] [CrossRef]
Parera, C.; Redondi, A.E.; Cesana, M.; Liao, Q.; Malanchini, I. Transfer learning for channel quality prediction. In Proceedings of the 2019 IEEE International Symposium on Measurements & Networking (M&N), Catania, Italy, 8–10 July 2019; pp. 1–6. [Google Scholar]
Park, S.; Jang, H.; Simeone, O.; Kang, J. Learning to demodulate from few pilots via offline and online meta-learning. IEEE Trans. Signal Process. 2020, 69, 226–239. [Google Scholar] [CrossRef]
Cohen, K.M.; Park, S.; Simeone, O.; Shamai, S. Learning to Learn to Demodulate with Uncertainty Quantification via Bayesian Meta-Learning. arXiv 2021, arXiv:2108.00785. [Google Scholar]
Mao, H.; Lu, H.; Lu, Y.; Zhu, D. RoemNet: Robust Meta Learning Based Channel Estimation in OFDM Systems. In Proceedings of the ICC 2019—2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019. [Google Scholar]
Raviv, T.; Park, S.; Simeone, O.; Eldar, Y.C.; Shlezinger, N. Online Meta-Learning For Hybrid Model-Based Deep Receivers. arXiv 2022, arXiv:2203.14359. [Google Scholar]
Jiang, Y.; Kim, H.; Asnani, H.; Kannan, S. MIND: Model Independent Neural Decoder. In Proceedings of the 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Cannes, France, 2–5 July 2019. [Google Scholar]
Park, S.; Simeone, O.; Kang, J. Meta-learning to communicate: Fast end-to-end training for fading channels. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
Park, S.; Simeone, O.; Kang, J. End-to-end fast training of communication links without a channel model via online meta-learning. In Proceedings of the 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Atlanta, GA, USA, 26–29 May 2020. [Google Scholar]
Goutay, M.; Aoudia, F.A.; Hoydis, J. Deep hypernetwork-based MIMO detection. In Proceedings of the 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Atlanta, GA, USA, 26–29 May 2020. [Google Scholar]
Zhang, J.; Yuan, Y.; Zheng, G.; Krikidis, I.; Wong, K.K. Embedding Model Based Fast Meta Learning for Downlink Beamforming Adaptation. IEEE Trans. Wirel. Commun. 2021, 21, 149–162. [Google Scholar] [CrossRef]
Karasik, R.; Simeone, O.; Jang, H.; Shamai, S. Learning to Broadcast for Ultra-Reliable Communication with Differential Quality of Service via the Conditional Value at Risk. arXiv 2021, arXiv:2112.02007. [Google Scholar]
Hu, Y.; Chen, M.; Saad, W.; Poor, H.V.; Cui, S. Distributed multi-agent meta learning for trajectory design in wireless drone networks. IEEE J. Sel. Areas Commun. 2021, 39, 3177–3192. [Google Scholar] [CrossRef]
Nikoloska, I.; Simeone, O. Black-Box and Modular Meta-Learning for Power Control via Random Edge Graph Neural Networks. arXiv 2021, arXiv:2108.13178. [Google Scholar]
Simeone, O.; Spagnolini, U. Lower bound on training-based channel estimation error for frequency-selective block-fading Rayleigh MIMO channels. IEEE Trans. Signal Process. 2004, 52, 3265–3277. [Google Scholar] [CrossRef]
Cicerone, M.; Simeone, O.; Spagnolini, U. Channel estimation for MIMO-OFDM systems by modal analysis/filtering. IEEE Trans. Commun. 2006, 54, 2062–2074. [Google Scholar] [CrossRef]
Pedersen, K.I.; Andersen, J.B.; Kermoal, J.P.; Mogensen, P. A stochastic multiple-input-multiple-output radio channel model for evaluation of space-time coding algorithms. In Proceedings of the Vehicular Technology Conference Fall 2000. IEEE VTS Fall VTC2000. 52nd Vehicular Technology Conference (Cat. No. 00CH37152), Boston, MA, USA, 24–28 September 2000; Volume 2, pp. 893–897. [Google Scholar]
Abdi, A.; Kaveh, M. A space-time correlation model for multielement antenna systems in mobile fading channels. IEEE J. Sel. Areas Commun. 2002, 20, 550–560. [Google Scholar] [CrossRef]
Park, S.; Simeone, O. Predicting Flat-Fading Channels via Meta-Learned Closed-Form Linear Filters and Equilibrium Propagation. arXiv 2021, arXiv:2110.00414. [Google Scholar]
3GPP. Study on Channel Model for Frequencies From 0.5 to 100 GHz (3GPP TR 38.901 Version 16.1.0 Release 16). TR 38.901. 2020. Available online: https://www.etsi.org/deliver/etsi_tr/138900_138999/138901/16.01.00_60/tr_138901v160100p.pdf (accessed on 16 September 2022).
Gallager, R.G. Principles of Digital Communication; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Simeone, O. A brief introduction to machine learning for engineers. Found. Trends^® Signal Process. 2018, 12, 200–431. [Google Scholar] [CrossRef]
Simeone, O. Machine Learning for Engineers; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
Prasad, R.; Murthy, C.R.; Rao, B.D. Joint approximately sparse channel estimation and data detection in OFDM systems using sparse Bayesian learning. IEEE Trans. Signal Process. 2014, 62, 3591–3603. [Google Scholar] [CrossRef]
Huang, C.; Liu, L.; Yuen, C.; Sun, S. Iterative channel estimation using LSE and sparse message passing for mmWave MIMO systems. IEEE Trans. Signal Process. 2018, 67, 245–259. [Google Scholar] [CrossRef]
Denevi, G.; Ciliberto, C.; Stamos, D.; Pontil, M. Learning to learn around a common mean. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
Yin, M.; Tucker, G.; Zhou, M.; Levine, S.; Finn, C. Meta-learning without memorization. arXiv 2019, arXiv:1912.03820. [Google Scholar]
Swindlehurst, A.L. Time delay and spatial signature estimation using known asynchronous signals. IEEE Trans. Signal Process. 1998, 46, 449–462. [Google Scholar] [CrossRef]
Nicoli, M.; Simeone, O.; Spagnolini, U. Multislot estimation of fast-varying space-time communication channels. IEEE Trans. Signal Process. 2003, 51, 1184–1195. [Google Scholar] [CrossRef]
Cortes, C.; Mohri, M.; Rostamizadeh, A. Algorithms for learning kernels based on centered alignment. J. Mach. Learn. Res. 2012, 13, 795–828. [Google Scholar]
Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
Wong, A.S.Y.; Wong, K.W.; Leung, C.S. A unified sequential method for PCA. In Proceedings of the ICECS’99 6th IEEE International Conference on Electronics, Circuits and Systems (Cat. No. 99EX357), Paphos, Cyprus, 5–8 September 1999; Volume 1, pp. 583–586. [Google Scholar]
Sidiropoulos, N.D.; De Lathauwer, L.; Fu, X.; Huang, K.; Papalexakis, E.E.; Faloutsos, C. Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process. 2017, 65, 3551–3582. [Google Scholar] [CrossRef]
Scellier, B.; Bengio, Y. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Front. Comput. Neurosci. 2017, 11, 24. [Google Scholar] [CrossRef]
Zucchet, N.; Schug, S.; von Oswald, J.; Zhao, D.; Sacramento, J. A contrastive rule for meta-learning. arXiv 2021, arXiv:2104.01677. [Google Scholar]
Wax, M.; Kailath, T. Detection of signals by information theoretic criteria. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 387–392. [Google Scholar] [CrossRef]
Liavas, A.P.; Regalia, P.A.; Delmas, J.P. Blind channel approximation: Effective channel order determination. IEEE Trans. Signal Process. 1999, 47, 3336–3344. [Google Scholar] [CrossRef]
Agrawal, A.; Andrews, J.G.; Cioffi, J.M.; Meng, T. Iterative power control for imperfect successive interference cancellation. IEEE Trans. Wirel. Commun. 2005, 4, 878–884. [Google Scholar] [CrossRef]
Recommendations ITU-R. Multipath Propagation and Parameterization of its Characteristics. Available online: https://www.itu.int/dms_pubrec/itu-r/rec/p/R-REC-P.1407-0-199907-S!!PDF-E.pdf (accessed on 22 September 2022).
Xiao, C.; Yang, C.; Li, M. Efficient alternating least squares algorithms for low multilinear rank approximation of tensors. J. Sci. Comput. 2021, 87, 1–25. [Google Scholar] [CrossRef]
Lorraine, J.; Vicol, P.; Duvenaud, D. Optimizing millions of hyperparameters by implicit differentiation. In Proceedings of the AISTATS, Virtual Event, 26–28 August 2020. [Google Scholar]

Figure 1. Illustration of the frame-based transmission system under study. At any frame f, based on the previous N channels

h_{l, f}^{N}

, we investigate the problem of optimizing the

δ

-lag prediction

{\hat{h}}_{l + δ, f}

.

Figure 1. Illustration of the frame-based transmission system under study. At any frame f, based on the previous N channels

h_{l, f}^{N}

, we investigate the problem of optimizing the

δ

-lag prediction

{\hat{h}}_{l + δ, f}

.

Figure 2. Illustration of the considered transfer and meta-learning methods. With access to pilots from previously received frames, transfer learning and meta-learning aim at obtaining the hyperparameters

\bar{V}

to be used for channel prediction in a new frame.

Figure 2. Illustration of the considered transfer and meta-learning methods. With access to pilots from previously received frames, transfer learning and meta-learning aim at obtaining the hyperparameters

\bar{V}

to be used for channel prediction in a new frame.

Figure 3. Illustration of the considered LSTD-based prediction model. (i) Estimate amplitudes

d_{l, f}

via the estimated long-term feature matrix

{\hat{B}}_{f}

. (ii) Feature-wise short-term fading amplitude prediction

{\hat{d}}_{l + δ, f}^{k}

based on feature-wise predictor

v_{f}^{k}

for

k = 1, \dots, K

. (iii) Reconstruction of the predicted channel

{\hat{h}}_{l + δ, f}

based on the feature matrix

{\hat{B}}_{f}

and predicted fading amplitude

{\hat{d}}_{l + δ, f}

.

Figure 3. Illustration of the considered LSTD-based prediction model. (i) Estimate amplitudes

d_{l, f}

via the estimated long-term feature matrix

{\hat{B}}_{f}

. (ii) Feature-wise short-term fading amplitude prediction

{\hat{d}}_{l + δ, f}^{k}

based on feature-wise predictor

v_{f}^{k}

for

k = 1, \dots, K

. (iii) Reconstruction of the predicted channel

{\hat{h}}_{l + δ, f}

based on the feature matrix

{\hat{B}}_{f}

and predicted fading amplitude

{\hat{d}}_{l + δ, f}

.

Figure 4. Multi-antenna frequency-flat channel prediction performance as a function of the total number of antennas,

N_{R} N_{T}

, under a single-clustered, single-tap (

W = 1

), 3GPP SCM channel model for a fast-varying environment with number of training samples

L^{new} = 1

(

K = 1

).

Figure 4. Multi-antenna frequency-flat channel prediction performance as a function of the total number of antennas,

N_{R} N_{T}

, under a single-clustered, single-tap (

W = 1

), 3GPP SCM channel model for a fast-varying environment with number of training samples

L^{new} = 1

(

K = 1

).

Figure 5. Multi-antenna frequency-flat channel prediction performance as a function of the total number of antennas,

N_{R} N_{T}

, under a single-clustered, single-tap (

W = 1

), 3GPP SCM channel model for a slow-varying environment with number of training samples

L^{new} = 1

(

K = 1

).

Figure 5. Multi-antenna frequency-flat channel prediction performance as a function of the total number of antennas,

N_{R} N_{T}

, under a single-clustered, single-tap (

W = 1

), 3GPP SCM channel model for a slow-varying environment with number of training samples

L^{new} = 1

(

K = 1

).

Figure 6. Multi-antenna frequency-selective channel prediction performance as a function of the number of features K, under 19-clustered, multi-taps

(W = 4)

, multi-antenna (

N_{T} = 8, N_{R} = 8

) 3GPP SCM channel model for a fast-varying environment with number of training samples

L^{new} = 1

. Results are evaluated with number of previous frames

F^{tr} = 20

for meta-training,

F^{val} = 20

for meta-validation, and

F^{te} = 200

for meta-test.

Figure 6. Multi-antenna frequency-selective channel prediction performance as a function of the number of features K, under 19-clustered, multi-taps

(W = 4)

, multi-antenna (

N_{T} = 8, N_{R} = 8

) 3GPP SCM channel model for a fast-varying environment with number of training samples

L^{new} = 1

. Results are evaluated with number of previous frames

F^{tr} = 20

for meta-training,

F^{val} = 20

for meta-validation, and

F^{te} = 200

for meta-test.

Figure 7. Single-antenna frequency-selective channel prediction performance as a function of delay spread ratio, under 19-clustered, multi-taps, single-antenna (

N_{T} = 1, N_{R} = 1

) 3GPP SCM channel model for a fast-varying environment (top) and slow-varying environment (bottom) with number of training samples

L^{new} = 1

(

K = 1

).

Figure 7. Single-antenna frequency-selective channel prediction performance as a function of delay spread ratio, under 19-clustered, multi-taps, single-antenna (

N_{T} = 1, N_{R} = 1

) 3GPP SCM channel model for a fast-varying environment (top) and slow-varying environment (bottom) with number of training samples

L^{new} = 1

(

K = 1

).

Figure 8. Multi-antenna frequency-selective channel prediction performance as a function of the number of training samples

L^{new}

, under 19-clustered, two taps (

W = 2

), multi-antenna (

N_{T} = 4

,

N_{R} = 2

) 3GPP SCM channel model for a fast-varying environment with total number of features

K = 2

unless determined by Section 5.2.

Figure 8. Multi-antenna frequency-selective channel prediction performance as a function of the number of training samples

L^{new}

, under 19-clustered, two taps (

W = 2

), multi-antenna (

N_{T} = 4

,

N_{R} = 2

) 3GPP SCM channel model for a fast-varying environment with total number of features

K = 2

unless determined by Section 5.2.

Figure 9. Multi-antenna frequency-selective channel prediction performance as a function of the number of training samples

L^{new}

, under 19-clustered, two taps (

W = 2

), multi-antenna (

N_{T} = 4

,

N_{R} = 2

) 3GPP SCM channel model for a slow-varying environment.

Figure 9. Multi-antenna frequency-selective channel prediction performance as a function of the number of training samples

L^{new}

, under 19-clustered, two taps (

W = 2

), multi-antenna (

N_{T} = 4

,

N_{R} = 2

) 3GPP SCM channel model for a slow-varying environment.

Figure 10. Multi-antenna frequency-selective channel prediction performance as a function of the number of available previous frames F under 19-clustered, two taps (

W = 2

), multi-antenna (

N_{T} = 4, N_{R} = 2

) 3GPP SCM channel model for a fast-varying environment (top) and slow-varying environment (bottom) with number of training samples

L^{new} = 1

.

Figure 10. Multi-antenna frequency-selective channel prediction performance as a function of the number of available previous frames F under 19-clustered, two taps (

W = 2

), multi-antenna (

N_{T} = 4, N_{R} = 2

) 3GPP SCM channel model for a fast-varying environment (top) and slow-varying environment (bottom) with number of training samples

L^{new} = 1

.

Table 1. Computational complexity analysis at deployment (meta-testing).

Learning Type	$O (\cdot)$ for Naïve Approach	$O (\cdot)$ for LSTD-Based Approach
Conventional learning	$O (S^{3} N^{3} + S^{2} N^{2} L^{tr})$	$O (K I_{ALS} (L^{tr} (S N^{2} + S^{2}) + N^{3} + S^{3}))$
Transfer learning	$O (S^{3} N^{3} + S^{2} N^{2} L^{tr})$	$O (K I_{ALS} (L^{tr} (S N^{2} + S^{2}) + N^{3} + S^{3}))$
Meta-learning	$O (S^{3} N^{3} + S^{2} N^{2} L^{tr})$	$O (K I_{ALS} (L^{tr} (S N^{2} + S^{2}) + N^{3} + S^{3}))$

Table 2. Computational complexity analysis during meta-training.

Learning Type	$O (\cdot)$ for Naïve Approach	$O (\cdot)$ for LSTD-Based Approach
Conventional learning	−	−
Transfer learning	$O (S^{3} N^{3} + S^{2} N^{2} F L)$	$O (K I_{ALS} (F L (S N^{2} + S^{2}) + N^{3} + S^{3}))$
Meta-learning	$O (F (S^{3} N^{3} + S^{2} N^{2} L$	$O (K F I_{EP} I_{ALS} (L (S N^{2} + S^{2}) + N^{3} + S^{3}))$
	$+ S^{2} N L^{te} (L^{tr} + S N)))$

Table 3. Experimental setting.

Window size $(N)$	5
Lag size $(δ)$	3
Number of previous frames $(F)$	500
Number of slots $(L + N - δ + 1)$	107
Frequency of the pilot signals ( $w_{SRS} / 2 π$ )	200
Normalized Doppler frequency
for slow-varying environment	$ρ \sim Unif [0.005, 0.05]$
Normalized Doppler frequency
for fast-varying environment	$ρ \sim Unif [0.1, 1]$
SNR for channel estimation	20 dB
Number of pilots for channel estimation	100

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, S.; Simeone, O. Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning. Entropy 2022, 24, 1363. https://doi.org/10.3390/e24101363

AMA Style

Park S, Simeone O. Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning. Entropy. 2022; 24(10):1363. https://doi.org/10.3390/e24101363

Chicago/Turabian Style

Park, Sangwoo, and Osvaldo Simeone. 2022. "Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning" Entropy 24, no. 10: 1363. https://doi.org/10.3390/e24101363

APA Style

Park, S., & Simeone, O. (2022). Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning. Entropy, 24(10), 1363. https://doi.org/10.3390/e24101363

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning

Abstract

1. Introduction

1.1. Context and Prior Art

1.2. Contributions

1.3. Organization

2. System Model

2.1. System Model

2.2. Channel Model

2.3. Conventional Learning

2.4. Transfer Learning and Meta-Learning

2.5. Incorporating Estimation Noise

3. Single-Antenna Frequency-Flat Channels

3.1. Conventional Learning

3.2. Transfer Learning

3.3. Meta-Learning

4. Multi-Antenna Frequency-Selective Channels

4.1. Naïve Extension

4.2. LSTD Channel Model

4.3. LSTD-Based Prediction Model

4.4. Conventional Learning for LSTD-Based Prediction

4.5. Transfer Learning for LSTD-Based Prediction

4.6. Meta-Learning for LSTD-Based Prediction

4.7. Rank-Estimation for LSTD-Based Prediction

5. Experiments

5.1. Multi-Antenna Frequency-Flat Channels

5.2. Rank Estimation

5.3. Single-Antenna Frequency-Selective Channels

5.4. Multi-Antenna Frequency-Selective Channel Case

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Derivation for Long-Short-Term-Decomposition (LSTD)-Based Predictor V f ( K )

Appendix B. Details on Conventional Learning for LSTD-Based Prediction

Appendix C. Details on Transfer Learning for LSTD-Based Prediction

Appendix D. Details on Meta-Learning for LSTD-Based Prediction

Appendix E. Details on the Antenna Configuration in Section 5.1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix A. Derivation for Long-Short-Term-Decomposition (LSTD)-Based Predictor $V_{f}^{(K)}$