# Multivariate Time Series Information Bottleneck

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Methods

#### 2.1. IB-Based Optimal Compression for Time Series Forecasts

#### 2.2. Compression by Source Masking

#### 2.3. Compressing Multi-Dimensional Data by Extreme Spatiotemporal Dimension Reduction

#### 2.4. Performing the Forecast

#### 2.4.1. Decoder

#### 2.4.2. Partial IB Loss with U-Net

#### 2.4.3. IB Interpretation with the Partial Loss

- If we remove the skipping layers, the bottleneck should not only retain the statistics of the transitions from the prior to the posterior but also the reconstruction statistics of the prior.
- If we also remove the source masking of the posterior in ${\mathbf{X}}_{1:T+F}$, the model is reduced to an autoencoder (AE) and the bottleneck is supposed to perform a dimension reduction of the MTS. Because of the curse of dimensionality, this technique is commonly used to further perform better classifications on the bottleneck representation than on the raw high-dimensional MTS data.

#### 2.5. Proposed Model

**Example of architecture when $128<max(M,T+F)\le 256$:**The maximum spatiotemporal size is $max(M,T+F)$, which can also be interpreted as the maximum width or height of the pseudo-images. In this situation, input pseudo-images are zero-padded to obtain a square shape $256\times 256$, and the investigated model is sketched in Figure 2; it uses the classical image extension architecture, U-Net [31], which has a symmetrical structure made of an encoder and a decoder, both with 8 layers. Implementation details of the architecture are given in the Appendix B, Table A2. The encoded representation ${\mathbf{Z}}_{ib\_tr}$ has a size of $1\times 1\times 512$. Each layer of the encoding part divides the width and height with strides of a factor of 2, and increases the number of channels up to 512. The decoder has a symmetrical structure, but the inputs of each layer are concatenations of upsampled versions of the previous layer’s outputs with the output of the symmetrical encoding layer. It is combined with partial convolutions (PCs) that were proposed in [4] to handle the masked data. These convolutions are applied at each hidden step and are designed to not take into account the missing data, such that ${\mathbf{X}}_{T+1:T+F}$ from ${\mathbf{X}}_{1:T+F}$ at the input layer. At each step of the encoding, the proportion of the masked part is reduced. Each PC is followed by a batch normalization and a ReLU activation, but for the last output layer, the activation is a sigmoid. For the training, we used input images of size $240\times 240$, which were center-padded by zeros to make an image of size $256\times 256$ for fitting the U-Net input size. Because of the 2 strides at the encoding steps and the $1\times 1$ size of the latent representation, a U-Net with 8 encoding layers requires input sizes of ${2}^{8}=256$.

#### 2.6. IRIS Dataset

#### 2.6.1. Problem Formulation

**The IRIS restrictions of online observations:**Figure 3 explains the IRIS observations in the atmosphere of the Sun with images of given wavelengths and spectra of given positions. Despite the very high precision of IRIS and its capacity to observe a very wide range of astrophysical parameters in time and space, significant difficulties inherent to online observations remain. Spectral observations are limited in time and space as they only correspond to the position of the slit at a given time, which may vary, and the satellite has to store the data before sending them to Earth-based stations [70]. IRIS observations are, therefore, very sparse in all of the potential observable parameters and they may lack a lot of data from other spatial positions. We may also be interested in further observations after the termination of the acquisition/recording session limited by the IRIS storage memory capacity.

**Non-homogeneous cadences of the data**time series modeling are usually performed by RNNs [73] or LSTMs [6], as briefly summarized in Figure 1. These models are designed for time series with fixed given cadences; Figure 4 shows the wide variety of our data cadences, making the use of RNN or LSTM difficult. To represent the time sequences ${\mathbf{X}}_{1:T+F}$ of the data under a common cadence, one should represent them by a cadence equal to the greatest common divisor of all of the cadences, which would obviously make those time sequences ${\mathbf{X}}_{1:T+F}$ highly sparse and penalize the learning of transitions between time steps.

**Clustering spectral data:**The 53 clusters of MgIIh/k lines found in [55] allow interpreting the physics on the surface of the Sun. We can compare the original and predicted time sequences through their clustered time sequences in order to prove the utility of our forecasting model in solar–physics by conserving the types of activities.

**Astrophysical features:**In [72], the authors defined ten solar spectra features to be used as dimensional reductions of spectral data for activity classification purposes. We studied the conservation of these features in the forecasted sequences to show the applicability to astrophysics.

#### 2.6.2. Proposed Approach

#### 2.7. Other MTS Dataset

**AL**dataset: The solar power dataset for the year 2006 in Alabama is publicly available (www.nrel.gov/grid/solar-power-data.html, accessed on 20 February 2023). It contains solar power data for 137 solar photovoltaic power plants. Power was sampled every 5 min in the year 2006. Preprocessing was conducted to only extract daily events by ignoring nights when data were zero. At each 5-min interval, the data consisted of vectors with 137 dimensions, and these vectors were normalized by their maximum coordinates. For example, in the case of IRIS data, the maximum value at each time was always set to 1.**PB**dataset: PeMS-BAY data [74] are publicly available (https://zenodo.org/record/5146275#.Y5hF7nbMI2w, accessed on 20 February 2023) and were selected from 325 sensors in the Bay Area of San Francisco by the California State Transportation Agency’s Performance Measurement System [75]. The data represent 6 months of traffic speeds ranging from January 1 to May 31 2017. At each 5-minute interval, the data consist of vectors with 325 dimensions, and these vectors are normalized by their maximum coordinates. For example, in the case of IRIS data, the maximum value at each time was always set to 1.

#### 2.8. Complementary Classifiers to Show Consistency with Applied Sciences

#### 2.9. Comparison with Other Models

- Unique joint spatiotemporal IB: The encoder jointly compresses spatial and temporal dimensions of the prior into a bottleneck with an extreme spatiotemporal dimensional reduction; this is our proposed IB-MTS formulation.
- MTS decomposition model, such as NBeats [3].

**LSTM**model: An LSTM cell [6] performs the one-step-ahead forecast and is trained to predict ${\mathbf{X}}_{t+1}$ from ${\mathbf{X}}_{t}$. It incorporates one layer with M LSTM units. For instance, for the $240\times 240$ spatiotemporal dimensions of IRIS data, the 180 first time steps are the prior data, and the 60 last time steps are the posterior data to forecast. This model is designed with 240 spatial LSTM/GRU units looped 180 times and all of the cell outputs are returned by the model using TensorFlow option $return\_sequences=True$. This layer returns a $180\times 240$ output and only the last 60 time steps are kept. Moreover, a source masking of the posterior is applied to the input and an identity skipping layer is added to transmit the prior ${\mathbf{X}}_{1:T}$ to the output at the same temporal positions in ${\mathbf{X}}_{1:T+F}$, such that the LSTM layer only accounts for predicting the posterior part ${\mathbf{X}}_{T+1:T+F}$. The number of units is directly determined by the shape of the input and output data. Details of the architecture are given in the Appendix B, Table A3.**GRU**model: A GRU [15] cell is trained to predict ${\mathbf{X}}_{t+F}$ from ${\mathbf{X}}_{t}$. The structure and number of units are the same, similar to the LSTM models, but GRU cells are used instead of LSTM ones. Details of the architecture are given in the Appendix B, Table A3.**ED-LSTM**model: A version using LSTM cells with an encoder and a decoder was implemented as described in Figure 5. Because of the encoder and decoder structures, we name it ED-LSTM. This model conducts multiple step-ahead forecasts and can forecast ${\mathbf{X}}_{T+1:T+F}$ from ${\mathbf{X}}_{1:T}$. The model incorporates LSTM cells organized into four layers: two layers of encoding into a bottleneck and two layers of decoding from the bottleneck. The first encoding layer is composed of 100 spatial units looped 180 times on the prior IRIS data and all of the cell outputs are returned by the model using TensorFlow option $return\_sequences=True$, returning a $180\times 100$ spatiotemporal output accounting for a spatial compression. The second encoding layer is composed of 100 spatial units looped 180 times and only the last cell outputs of the recurrences are returned, returning a 100-dimensional bottleneck that accounts for a spatial compression followed by a temporal compression. This bottleneck representation is repeated 60 times for IRIS data in order to model the decoding of 60 posterior time steps to forecast. After this repetition, the data are $60\times 100$ and fed to the first decoding layer with 100 spatial units looped 60 times and initialized with the states obtained from the second encoding layer; indeed, the structure is symmetrical, such that the first and second decoding layers are, respectively, the images of the second and the first encoding layers. All cell outputs are returned by the model using TensorFlow option $return\_sequences=True$, such that a $60\times 100$ spatiotemporal output is returned. The second layer of the decoder is designed with 100 spatial units looped 60 times and initialized with the states obtained from the first encoding layer, such that a $60\times 100$ output is returned. In the end, a time-distributed dense layer is used to map the $60\times 100$ output data into a $60\times 180$ MTS data format. For these models, the input and output shapes are determined by the data and one can only change the number of spatial cells ${n}_{1}$ and ${n}_{2}$ used, respectively, in the first and second layers of the encoding part. ${n}_{2}$ determines the dimension of the bottleneck and on the IRIS data, $180>{n}_{1}\ge {n}_{2}\ge 1$. Our experiments show that the results of these models do not depend much on the values of ${n}_{1}$ and ${n}_{2}$, but significantly drop when ${n}_{2}$ is very small, close to 1. Details of the architecture are given in the Appendix B, Table A4.**ED-GRU**model: This model follows the same structure as the ED-LSTM but with GRU cells instead of LSTM cells. The structure and number of units are the same as with ED-LSTM models, but GRU cells are used instead of LSTM ones. Details of the architecture are given in the Appendix B, Table A4.**NBeats**model: We use the code given in the original paper [3]. This model can forecast ${\mathbf{X}}_{T+1:T+F}$ from ${\mathbf{X}}_{1:T}$. The model is used in its generic architecture as described in [3], with 2 blocks per stack, theta dimensions of $(4,4)$, shared weights in stacks, and 100 hidden layers units. For IRIS data, the prior is $180\times 240$, the forecast posterior is $60\times 240$, and the backcast posterior is $180\times 240$, but we also use a skipping layer to connect the input to the backcast posterior, such that the model is forced to learn the transition between the prior and the forecast posterior. we attempted NBeats with other settings and other numbers of stacks without the gain of performance, and when the number of stacks became greater than 4, the model failed to initialize on our machines.

## 3. Results

**MTS metrics:**MAE, MAPE, and RMSE evaluation. These metrics are defined at each time step as the means of $|{\mathbf{X}}_{t}-{\widehat{\mathbf{X}}}_{t}|$ for MAE, $|{\mathbf{X}}_{t}-{\widehat{\mathbf{X}}}_{t}|/|{\mathbf{X}}_{t}|$ for MAPE, and the square root of the mean of ${\left({\mathbf{X}}_{t}-{\widehat{\mathbf{X}}}_{t}\right)}^{2}$ for RMSE.**CV metrics:**PSNR and SSIM evaluation. The ${\mathrm{PSNR}}_{t}$ is defined at each time step t as $-10{log}_{10}\left({\mathrm{MSE}}_{t}\right)$, with ${\mathrm{MSE}}_{t}$ being the mean of ${\left({\mathbf{X}}_{t}-{\widehat{\mathbf{X}}}_{t}\right)}^{2}$. The larger the PSNR, the better the prediction. The SSIM is defined at each time step by [84]:$$\begin{array}{c}\hfill SSI{M}_{t}={\displaystyle \frac{(2{\mu}_{t}\widehat{{\mu}_{t}}+{\left(0.01L\right)}^{2})(2{\sigma}_{t}\widehat{{\sigma}_{t}}+{\left(0.03L\right)}^{2})(co{v}_{t}+{\left(0.0212L\right)}^{2})}{({\mu}_{t}^{2}+{\widehat{{\mu}_{t}}}^{2}+{\left(0.01L\right)}^{2})({\sigma}_{t}^{2}+{\widehat{{\sigma}_{t}}}^{2}+{\left(0.03L\right)}^{2})({\sigma}_{t}\widehat{{\sigma}_{t}}+{\left(0.0212L\right)}^{2})}},\end{array}$$**Astrophysical features**evaluation: Twelve features defined in [72] are evaluated for IRIS data. For these data, each time step corresponds to an observed spectral line in a particular region of the Sun. The intensity, triplet intensity, line center, line width, line asymmetry, total continuum, triplet emission, k/h ratio integrated, k/h ratio max, k-height, peak ratio, and peak separation are the twelve measures on these spectral lines. These features provide insight into the nature of physics occurring at the observed region of the Sun. These metrics are evaluated at each time to show that the IB principle and a powerful CV metric are sufficient to provide reliable predictions in terms of physics.**The IB evaluation**is performed on centroid distributions in the prior ${\mathbf{X}}_{1:T}$, genuine ${\mathbf{X}}_{T+1:T+F}$, and predicted forecasts ${\widehat{\mathbf{X}}}_{T+1:T+F}$. A k-means was performed in [55] for the spectral lines ${\mathbf{X}}_{t}$ that are to be predicted over time. The corresponding centroids C are used in this work to evaluate information theory measurements on the quantized data. Entropies for the prior $H\left({c}_{0}\right)$, genuine $H\left({c}_{1}\right)$, and predicted $H\left({c}_{2}\right)$ distributions were averaged on the test data, and a comparison of the distributions between the prediction and the genuine was evaluated by computing the mutual information $I({c}_{1};{c}_{2})$.- The
**classification accuracy**between the genuine and the forecast classifications was also evaluated. In the context of the IRIS data, three classes of solar activity are considered: QS, AR, and FL. Classifications are compared between the genuine target ${\mathbf{X}}_{T+1:T+F}$ and predicted forecast ${\widehat{\mathbf{X}}}_{T+1:T+F}$, to assert whether the forecast activity complies with the targeted activity. $TSS$ [76] and $HSS$ [77] are evaluated globally and for each prediction class. These scores are defined in Section 2.8.

#### 3.1. Evaluations of Predictions on IRIS Data

#### Longer Predictions

#### 3.2. MTS Metrics Evaluation

#### 3.3. Computer Vision Metrics Evaluation

#### 3.3.1. Information Bottleneck Evaluation on IRIS Data

#### 3.3.2. Astrophysical Evaluations

#### 3.3.3. Solar Activity Classification

## 4. Discussion

#### 4.1. Conclusions

#### 4.2. Spatial Sorting of MTS Data

#### 4.3. Non-Homogeneous Cadences of IRIS Data

#### 4.4. Pros

#### 4.5. Cons and Possible Extensions

#### 4.6. Future Research Directions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

AL | solar power dataset for the year 2006 in Alabama |

AR | solar active region |

ARIMA | autoregressive integrated moving average model |

CV | computer vision |

DNN | deep neural network |

FL | solar flare |

GNN | graph neural network |

GRU | gated recurrent unit |

HSS | Heidke skill score |

IB | information bottleneck |

IRIS | NASA’s interface region imaging spectrograph satellite |

IT | information theory |

LSTM | long short-term memory model |

MAE | mean absolute error |

MAPE | mean absolute percentage error |

ML | machine learning |

MOS | mean opinion score |

MSE | mean square error |

MTS | multiple time series |

NLP | natural language processing |

NN | neural network |

PB | PeMS-BAY dataset |

PC | partial convolution |

PSNR | peak signal-to-noise ratio |

QS | quiet Sun |

RAM | random access memory |

RGB | red–green–blue |

RMSE | root mean square error |

RNN | recurrent neural network |

SSIM | structural similarity |

TS | time series |

TSS | true skill statistic |

## Appendix A. Theory

Random Variables | |
---|---|

$\mathbf{X}$ | generic spatiotemporal data |

$\tilde{\mathbf{X}}$ | estimation of $\mathbf{X}$ |

T | scalar duration of the prior sequence |

F | scalar duration of the posterior sequence |

M | spatial size of spatiotemporal data |

${\mathbf{X}}_{t}$ | multidimensional data at time step t |

${X}_{t}^{m}$ | scalar value at time step t and spatial index m |

${\mathbf{X}}_{1:T}={\mathbf{X}}_{1:T}^{1:M}$ | prior sequence |

${\tilde{\mathbf{X}}}_{1:T}$ | prior sequence estimation |

${\mathbf{X}}_{T+1:T+F}={\mathbf{X}}_{T+1:T+F}^{1:M}$ | posterior genuine sequence |

${\tilde{\mathbf{X}}}_{T+1:T+F}$ | posterior genuine sequence estimation |

${\mathbf{X}}_{1:T+F}={\mathbf{X}}_{1:T+F}^{1:M}$ | full sequence |

${\tilde{\mathbf{X}}}_{1:T+F}$ | full sequence estimation |

$\mathbf{Z}$ | bottleneck |

$1:T\to T+1:T+F$ | transition from prior to posterior |

${\mathbf{Z}}_{ib\_tr}$ | IB bottleneck for transition $1:T\to T+1:T+F$ learning |

${\mathbf{Z}}_{ib\_ae}$ | IB bottleneck of AE |

$\mathbf{M}$ | generic mask for spatiotemporal data |

${\mathbf{M}}_{1:T}$ | prior mask |

${\mathbf{M}}_{T+1:T+F}$ | posterior mask |

${\mathbf{M}}_{1:T+F}$ | full mask |

${\mathbf{M}}_{ib\_tr}$ | mask at the IB bottleneck for transition |

$1:T\to T+1:T+F$ learning | |

$\mathbf{1}$ & ${\mathbf{1}}_{len}$ & ${\mathbf{1}}_{row\times col}$ | vectors and matrices of ones, eventually |

with specified length $len$ or $row$ and $col$ sizes. | |

$\mathbf{0}$ & ${\mathbf{0}}_{len}$ & ${\mathbf{0}}_{row\times col}$ | vectors and matrices of zeros, eventually |

with specified length $len$ or $row$ and $col$ sizes. | |

K & $\mathbf{K}$ | scalar & categorical labels |

${c}_{0}$, ${c}_{1}$ & ${c}_{2}$ | prior, genuine, and predicted centroid assignments |

Information Theory | |

${p}_{D}$ | data distribution |

${p}_{{\Theta}}$ & ${p}_{{\Phi}}$ | (encoding/decoding) distribution with parameter (${\Theta}$/${\Phi}$) |

${\mathbb{E}}_{{p}_{D}}\left[\xb7\right]$ | mean by sampling from ${p}_{D}$ |

${\mathbb{E}}_{{p}_{{\Theta}}}\left[\xb7\right]$ | mean by sampling from ${p}_{{\Theta}}\left(\mathbf{Z}\right|\mathbf{X})$ |

$H(\xb7)$ & $H(\xb7,\xb7)$ | generic entropy and cross-entropy |

${H}_{{p}_{{\Theta}}}$ | entropy parametrized by the encoder |

${H}_{{p}_{{\Theta}},{p}_{{\Phi}}}$ | cross-entropy parametrized by the encoder and the decoder |

$KL(\xb7||\xb7)$ & $I(\xb7;\xb7)$ | $KL$-divergence & mutual information |

${I}_{{\Theta}}$ & ${I}_{{\Phi}}$ | Encoding and decoding mutual information |

Layers & mappers | |

$Id$ | Identity mapper |

$Concat$ | Concatenation of tensors |

$(-/B/P)Conv$ | (-/Binary/Partial) Convolutional layer |

$(-/B/P)DConv$ | (-/Binary/Partial) Deconvolutional layer |

Losses & metrics | |

$\mathcal{L}$ | generic loss |

$\tilde{\mathcal{L}}={\mathcal{L}}_{1}+{\mathcal{L}}_{2}+{\mathcal{L}}_{3}$ | upper bound on the loss $\mathcal{L}$ |

${\mathcal{L}}_{3}^{Lap}$ | ${\mathcal{L}}_{3}$ with Laplacian assumption of ${p}_{{\Phi}}\left({\mathbf{X}}_{T+1:T+F}\right|{\mathbf{Z}}_{ib\_tr})$ |

${\mathcal{L}}_{3}^{Lap,UNet}$ | ${\mathcal{L}}_{3}^{Lap}$ for a U-Net architecture |

$re{f}_{k,t}$ | relative error for feature k at time step t |

**Proof**

**of**

**Equation**

**(5).**

**Remark**

**A1**

**(Details on**$PConv$

**and**$PDeconv$

**for MTS).**

**Proof**

**of**

**Equation**

**(A8).**

**Proof**

**of**

**Equation**

**(A10).**

**Proof**

**of**

**Equation**

**(A13).**

## Appendix B. Models

**Table A2.**Model summary of IB-MTS for IRIS data. The encoding part has 7 successive repetitions indexed by $i\in \left[0:7\right]$ of PConv and BatchNormalization layers followed by ReLu activations, except when $i=0$, no BatchNormalization layer is included. The decoding part has 7 successive repetitions indexed by $i\in \left[0:7\right]$ of UpSampling, Concatenation, PConv, and BatchNormalization layers followed by ReLu activations, except when $i=7$, no BatchNormalization layer is included. EncPConv2D${}_{0}$ is connected to [zero_pad2d${}_{1}$, zero_pad2d${}_{2}$]. DecUpImg${}_{0}$ is connected to [EncReLu${}_{7}$]. DecUpMsk${}_{0}$ is connected to [EncPConv2D${}_{7}$[1]]. DecConcatImg${}_{7}$ is connected to [zero_pad2d${}_{1}$[1], DecUpImg${}_{7}$]. DecConcatMsk${}_{7}$ is connected to [zero_pad2d${}_{2}$[1], DecUpMsk${}_{7}$].

Model: IB-MTS for IRIS Data | ||||
---|---|---|---|---|

Layer (Type) | Output Shape | Kernel | Param # | Connected to |

inputs_img | [(240, 240, 1)] | – | 0 | [] |

(InputLayer) | ||||

inputs_mask | [(240, 240, 1)] | – | 0 | [] |

(InputLayer) | ||||

zero_pad2d${}_{1}$ | (256, 256, 1) | – | 0 | [inputs_img] |

(ZeroPad2D) | ||||

zero_pad2d${}_{2}$ | (256, 256, 1) | – | 0 | [inputs_mask] |

(ZeroPad2D) | ||||

EncPConv2D${}_{i}$ | $i=0$ (128,128,64) $i=1$ (64, 64, 128) $i=2$ (32, 32, 256) $i=3$ (16, 16, 512) $i=4$ (8, 8, 512) $i=5$ (4, 4, 512) $i=6$ (2, 2, 512) $i=7$ (1, 1, 512) |
7 5 5 3 3 3 3 3 |
18880 410240 1639168 2361856 4721152 4721152 4721152 4721152 | [EncReLu${}_{i-1}$, |

(PConv2D) | EncPConv2D${}_{i-1}$[1]] | |||

EncBN${}_{i\ne 0}$ | [EncPConv2D${}_{i}$[0]] | |||

(BatchNorm) | ||||

EncReLu${}_{i}$ | [EncBN${}_{i}$] | |||

(Activation) | ||||

DecUpImg${}_{i}$ | $i=0$ (2, 2, 512) $i=1$ (4, 4, 512) $i=2$ (16, 16, 512) $i=3$ (32, 32, 512) $i=4$ (64, 64, 256) $i=5$ (128, 128, 128) $i=6$ (256, 256, 64) $i=7$ (256, 256, 3) |
3 3 3 3 3 3 3 3 |
9439744 9439744 9439744 9439744 3540224 885376 221504 3621 | [DecLReLU${}_{i-1}$] |

(UpSampling2D) | ||||

DecUpMsk${}_{i}$ | [DecPConv2D${}_{i-1}$[1]] | |||

(UpSampling2D) | ||||

DecConcatImg${}_{i}$ | [EncReLu${}_{6-i}$, | |||

(Concatenate) | DecUpImg${}_{i}$] | |||

DecConcatMsk${}_{i}$ | [EncPConv2D${}_{6-i}$[1], | |||

(Concatenate) | DecUpMsk${}_{i}$] | |||

DecPConv2D${}_{i}$ | [DecConcatImg${}_{i}$, | |||

(PConv2D) | DecConcatMsk${}_{i}$] | |||

DecBN${}_{i\ne 7}$ | [DecPConv2D${}_{i}$[0]] | |||

(BatchNorm) | ||||

DecLReLU${}_{i}$ | [DecBN${}_{i}$] | |||

(LeakyReLU) | ||||

outputs_img | (256, 256, 1) | 1 | 4 | [DecLReLU7] |

(Conv2D) | ||||

OutCrop | (240, 240, 1) | – | 0 | [outputs_img] |

(Cropping2D) | ||||

Total params: 65,724,969 |

**Table A3.**Model summary of LSTM for the IRIS data. The ED-GRU model replaces the LSTM layer with a GRU layer consisting of 240 units and has a total of 347,040 parameters.

Model: LSTM for IRIS Data | ||||
---|---|---|---|---|

Layer(Type) | Output Shape | Units | Param # | Connected to |

inputs_seq (InputLayer) | [(240, 240)] | – | 0 | [] |

slice${}_{0}$ (SlicingOp) | (180, 240) | – | 0 | [inputs_seq] |

lstm (LSTM) | (180, 240) | 240 | 461760 | [slice${}_{0}$] |

slice${}_{1}$ (SlicingOp) | (60, 240) | – | 0 | [lstm] |

concat (TFOp) | (240, 240) | – | 0 | [slice${}_{0}$, slice${}_{1}$] |

Total params: 461,760 |

**Table A4.**Model summary for ED-LSTM for the IRIS data. The ED-GRU model replaces the LSTM layers with GRU layers, each consisting of 100 units. The total number of parameters in the model is 164,100.

Model: ED-LSTM for IRIS Data | ||||
---|---|---|---|---|

Layer(Type) | Output Shape | Units | Param # | Connected to |

inputs_seq (InputLayer) | [(240, 240)] | – | 0 | [] |

slice (SlicingOp) | (180, 240) | – | 0 | [inputs_seq] |

lstm${}_{0}$ (LSTM) | [(180, 100), (100)] | 100 | 136400 | [slice] |

lstm${}_{1}$ (LSTM) | [(100), (100)] | 100 | 80400 | [lstm${}_{0}$[0]] |

repeat_vector (RepeatVector) | (60, 100) | – | 0 | [lstm${}_{1}$[0]] |

lstm${}_{2}$ (LSTM) | (60, 100) | 100 | 80400 | [repeat_vector, lstm${}_{0}$[0][2]] |

lstm${}_{3}$ (LSTM) | (60, 100) | 100 | 80400 | [lstm${}_{2}$[0], lstm${}_{1}$[1]] |

time_distributed (TimeDistributed) | (60, 240) | 240 | 24240 | [lstm${}_{3}$] |

concat (TFOp) | (240, 240) | – | 0 | [slice, time_distributed] |

Total params: 401,840 |

## Appendix C. Results

**Figure A1.**Detailed MTS metrics evaluation on the test set for the direct prediction setup. The evaluations are given for each solar activity: the first row of results is for QS activity, the second row is for AR, and the last row is for FL.

**Figure A2.**Detailed MTS metrics evaluation on the test set for the iterated prediction setup. The evaluations are given for each solar activity: the first row of results is for QS activity, the second row is for AR, and the last row is for FL.

**Figure A3.**Confusion matrices for the prediction of centroids on IRIS data, for the direct procedure. We used the 53 centroids from [55]. Each row of results corresponds to a model. Columns are organized by data labels: global aggregate results for QS, AR, and FL data; other columns present the result for of each label, taken separately. Each confusion matrix gives results in terms of join probability distribution values between the genuine and the predicted. Probability values are displayed with color maps, where violet is the lowest probability and yellow is the highest.

**Figure A4.**Confusion matrices for the prediction of centroids on IRIS data, for the iterated procedure. We used the 53 centroids from [55]. Each row of results corresponds to a model. Columns are organized by data labels: global aggregate results for QS, AR, and FL data; other columns present the results for of each label, taken separately. Each confusion matrix provides results in terms of join probability distribution values, between the genuine and the predicted. Probability values are displayed with color maps, where violet is the lowest probability and yellow is the highest.

## References

- Gangopadhyay, T.; Tan, S.Y.; Jiang, Z.; Meng, R.; Sarkar, S. Spatiotemporal Attention for Multivariate Time Series Prediction and Interpretation. arXiv
**2020**, arXiv:2008.04882. [Google Scholar] - Flunkert, V.; Salinas, D.; Gasthaus, J. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. arXiv
**2017**, arXiv:1704.04110. [Google Scholar] - Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv
**2019**, arXiv:1905.10437. [Google Scholar] - Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.; Tao, A.; Catanzaro, B. Image Inpainting for Irregular Holes Using Partial Convolutions. arXiv
**2018**, arXiv:1804.07723. [Google Scholar] - Rumelhart, D.E.; McClelland, J.L. Learning Internal Representations by Error Propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations; The MIT Press: Cambridge, MA, USA, 1987; Volume 1, pp. 318–362. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] - Dobson, A. The Oxford Dictionary of Statistical Terms; Oxford University Press: Oxford, UK, 2003; p. 506. [Google Scholar]
- Kendall, M. Time Series; Charles Griffin and Co Ltd.: London, UK; High Wycombe, UK, 1976. [Google Scholar]
- West, M. Time Series Decomposition. Biometrika
**1997**, 84, 489–494. [Google Scholar] [CrossRef] - Sheather, S. A Modern Approach to Regression with R; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
- Molugaram, K.; Rao, G.S. Chapter 12—Analysis of Time Series. In Statistical Techniques for Transportation Engineering; Molugaram, K., Rao, G.S., Eds.; Butterworth-Heinemann: Oxford, UK, 2017; pp. 463–489. [Google Scholar] [CrossRef]
- Gardner, E.S. Exponential smoothing: The state of the art. J. Forecast.
**1985**, 4, 1–28. [Google Scholar] [CrossRef] - Box, G.; Jenkins, G.M. Time Series Analysis: Forecasting and Control; Holden-Day: Cleveland, Australia, 1976. [Google Scholar]
- Curry, H.B. The method of steepest descent for nonlinear minimization problems. Quart. Appl. Math.
**1944**, 2, 258–261. [Google Scholar] [CrossRef] - Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. arXiv
**2014**, arXiv:1409.1259. [Google Scholar] - Kazemi, S.M.; Goel, R.; Eghbali, S.; Ramanan, J.; Sahota, J.; Thakur, S.; Wu, S.; Smyth, C.; Poupart, P.; Brubaker, M. Time2Vec: Learning a Vector Representation of Time. arXiv
**2019**, arXiv:1907.05321. [Google Scholar] [CrossRef] - Lim, B.; Arik, S.O.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting. arXiv
**2019**, arXiv:1912.09363. [Google Scholar] [CrossRef] - Grigsby, J.; Wang, Z.; Qi, Y. Long-Range Transformers for Dynamic Spatiotemporal Forecasting. arXiv
**2021**, arXiv:2109.12218. [Google Scholar] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv
**2021**, arXiv:1706.03762. [Google Scholar] - Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. arXiv
**2020**, arXiv:2012.07436. [Google Scholar] [CrossRef] - Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw.
**2009**, 20, 61–80. [Google Scholar] [CrossRef] [PubMed] - Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image Inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 23–28 July 2000; ACM Press/Addison-Wesley Publishing Co.: Boston, MA, USA, 2000. SIGGRAPH ’00. pp. 417–424. [Google Scholar] [CrossRef]
- Teterwak, P.; Sarna, A.; Krishnan, D.; Maschinot, A.; Belanger, D.; Liu, C.; Freeman, W.T. Boundless: Generative Adversarial Networks for Image Extension. arXiv
**2019**, arXiv:1908.07007 2019. [Google Scholar] [CrossRef] - Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Deep Image Prior. Int. J. Comput. Vis.
**2020**, 128, 1867–1888. [Google Scholar] [CrossRef] - Dama, F.; Sinoquet, C. Time Series Analysis and Modeling to Forecast: A Survey. arXiv
**2021**, arXiv:2104.00164. [Google Scholar] [CrossRef] - Tessoni, V.; Amoretti, M. Advanced statistical and machine learning methods for multi-step multivariate time series forecasting in predictive maintenance. Procedia Comput. Sci.
**2022**, 200, 748–757. [Google Scholar] [CrossRef] - Lehtinen, J.; Munkberg, J.; Hasselgren, J.; Laine, S.; Karras, T.; Aittala, M.; Aila, T. Noise2Noise: Learning Image Restoration without Clean Data. arXiv
**2018**, arXiv:1803.04189. [Google Scholar] - Gatys, L.A.; Ecker, A.S.; Bethge, M. Image Style Transfer Using Convolutional Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
- Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. arXiv
**2016**, arXiv:1603.08155. [Google Scholar] - Ledig, C.; Theis, L.; Huszar, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. arXiv
**2016**, arXiv:1609.04802. [Google Scholar] - Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv
**2015**, arXiv:1505.04597. [Google Scholar] - Kramer, M.A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J.
**1991**, 37, 233–243. [Google Scholar] [CrossRef] - Tishby, N.; Zaslavsky, N. Deep Learning and the Information Bottleneck Principle. arXiv
**2015**, arXiv:1503.02406. [Google Scholar] [CrossRef] - Costa, J.; Costa, A.; Kenda, K.; Costa, J.P. Entropy for Time Series Forecasting. In Proceedings of the Slovenian KDD Conference, Ljubljana, Slovenia, 4 October 2021; Available online: https://ailab.ijs.si/dunja/SiKDD2021/Papers/Costaetal_2.pdf (accessed on 20 February 2023).
- Zapart, C.A. Forecasting with Entropy. In Proceedings of the Econophysics Colloquium, Taipei, Taiwan, 4–6 November 2010; Available online: https://www.phys.sinica.edu.tw/~socioecono/econophysics2010/pdfs/ZapartPaper.pdf (accessed on 20 February 2023).
- Xu, D.; Fekri, F. Time Series Prediction Via Recurrent Neural Networks with the Information Bottleneck Principle. In Proceedings of the 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Kalamata, Greece, 25–28 June 2018; pp. 1–5. [Google Scholar] [CrossRef]
- Ponce-Flores, M.; Frausto-Solís, J.; Santamaría-Bonfil, G.; Pérez-Ortega, J.; González-Barbosa, J.J. Time Series Complexities and Their Relationship to Forecasting Performance. Entropy
**2020**, 22, 89. [Google Scholar] [CrossRef] [PubMed] - Zaidi, A.; Estella-Aguerri, I.; Shamai (Shitz), S. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy
**2020**, 22, 151. [Google Scholar] [CrossRef] [PubMed] - Voloshynovskiy, S.; Kondah, M.; Rezaeifar, S.; Taran, O.; Holotyak, T.; Rezende, D.J. Information bottleneck through variational glasses. arXiv
**2019**, arXiv:1912.00830. [Google Scholar] [CrossRef] - Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep Variational Information Bottleneck. arXiv
**2016**, arXiv:1612.00410. [Google Scholar] [CrossRef] - Ullmann, D.; Rezaeifar, S.; Taran, O.; Holotyak, T.; Panos, B.; Voloshynovskiy, S. Information Bottleneck Classification in Extremely Distributed Systems. Entropy
**2020**, 22, 237. [Google Scholar] [CrossRef] - Geiger, B.C.; Kubin, G. Information Bottleneck: Theory and Applications in Deep Learning. Entropy
**2020**, 22, 1408. [Google Scholar] [CrossRef] [PubMed] - Lee, S.; Jo, J. Information Flows of Diverse Autoencoders. Entropy
**2021**, 23, 862. [Google Scholar] [CrossRef] - Tapia, N.I.; Estévez, P.A. On the Information Plane of Autoencoders. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
- Zarcone, R.; Paiton, D.; Anderson, A.; Engel, J.; Wong, H.P.; Olshausen, B. Joint Source-Channel Coding with Neural Networks for Analog Data Compression and Storage. In Proceedings of the 2018 Data Compression Conference, Snowbird, UT, USA, 27–30 March 2018; pp. 147–156. [Google Scholar] [CrossRef]
- Boquet, G.; Macias, E.; Morell, A.; Serrano, J.; Vicario, J.L. Theoretical Tuning of the Autoencoder Bottleneck Layer Dimension: A Mutual Information-based Algorithm. In Proceedings of the 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 18–21 January 2021; pp. 1512–1516. [Google Scholar] [CrossRef]
- Voloshynovskiy, S.; Taran, O.; Kondah, M.; Holotyak, T.; Rezende, D. Variational Information Bottleneck for Semi-Supervised Classification. Entropy
**2020**, 22, 943. [Google Scholar] [CrossRef] - Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 27, 379–423. [Google Scholar] [CrossRef] - Barnes, G.; Leka, K.D.; Schrijver, C.J.; Colak, T.; Qahwaji, R.; Ashamari, O.W.; Yuan, Y.; Zhang, J.; McAteer, R.T.J.; Bloomfield, D.S.; et al. A comparison of flare forecasting methods. Astrophys. J.
**2016**, 829, 89. [Google Scholar] [CrossRef] - Guennou, C.; Pariat, E.; Leake, J.E.; Vilmer, N. Testing predictors of eruptivity using parametric flux emergence simulations. J. Space Weather Space Clim.
**2017**, 7, A17. [Google Scholar] [CrossRef] - Benvenuto, F.; Piana, M.; Campi, C.; Massone, A.M. A Hybrid Supervised/Unsupervised Machine Learning Approach to Solar Flare Prediction. Astrophys. J.
**2018**, 853, 90. [Google Scholar] [CrossRef] - Florios, K.; Kontogiannis, I.; Park, S.H.; Guerra, J.A.; Benvenuto, F.; Bloomfield, D.S.; Georgoulis, M.K. Forecasting Solar Flares Using Magnetogram-based Predictors and Machine Learning. Sol. Phys.
**2018**, 293, 28. [Google Scholar] [CrossRef] - Kontogiannis, I.; Georgoulis, M.K.; Park, S.H.; Guerra, J.A. Testing and Improving a Set of Morphological Predictors of Flaring Activity. Sol. Phys.
**2018**, 293, 96. [Google Scholar] [CrossRef] - Ullmann, D.; Voloshynovskiy, S.; Kleint, L.; Krucker, S.; Melchior, M.; Huwyler, C.; Panos, B. DCT-Tensor-Net for Solar Flares Detection on IRIS Data. In Proceedings of the 2018 7th European Workshop on Visual Information Processing (EUVIP), Tampere, Finland, 26–28 November 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Panos, B.; Kleint, L.; Huwyler, C.; Krucker, S.; Melchior, M.; Ullmann, D.; Voloshynovskiy, S. Identifying Typical Mg ii Flare Spectra Using Machine Learning. Astrophys. J.
**2018**, 861, 62. [Google Scholar] [CrossRef] - Murray, S.A.; Bingham, S.; Sharpe, M.; Jackson, D.R. Flare forecasting at the Met Office Space Weather Operations Centre. Space Weather
**2017**, 15, 577–588. [Google Scholar] [CrossRef] - Sharpe, M.A.; Murray, S.A. Verification of Space Weather Forecasts Issued by the Met Office Space Weather Operations Centre. Space Weather
**2017**, 15, 1383–1395. [Google Scholar] [CrossRef] - Chen, Y.; Manchester, W.B.; Hero, A.O.; Toth, G.; DuFumier, B.; Zhou, T.; Wang, X.; Zhu, H.; Sun, Z.; Gombosi, T.I. Identifying Solar Flare Precursors Using Time Series of SDO/HMI Images and SHARP Parameters. arXiv
**2019**, arXiv:1904.00125. [Google Scholar] [CrossRef] - Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Graph Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. arXiv
**2017**, arXiv:1707.01926. [Google Scholar] - Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal Graph Convolutional Neural Network: A Deep Learning Framework for Traffic Forecasting. arXiv
**2017**, arXiv:1709.04875. [Google Scholar] - Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-Form Image Inpainting with Gated Convolution. arXiv
**2018**, arXiv:1806.03589. [Google Scholar] - Gatys, L.A.; Ecker, A.S.; Bethge, M. A Neural Algorithm of Artistic Style. arXiv
**2015**, arXiv:1508.06576. [Google Scholar] [CrossRef] - Wang, C.; Xu, C.; Wang, C.; Tao, D. Perceptual Adversarial Networks for Image-to-Image Transformation. IEEE Trans. Image Process.
**2018**, 27, 4066–4079. [Google Scholar] [CrossRef] [PubMed] - Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv
**2014**, arXiv:1409.1556. [Google Scholar] [CrossRef] - Kobyzev, I.; Prince, S.J.; Brubaker, M.A. Normalizing Flows: An Introduction and Review of Current Methods. IEEE Trans. Pattern Anal. Mach. Intell.
**2021**, 43, 3964–3979. [Google Scholar] [CrossRef] [PubMed] - Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT Pre-Training of Image Transformers. arXiv
**2022**, arXiv:2106.08254. [Google Scholar] - Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. SimMIM: A Simple Framework for Masked Image Modeling. arXiv
**2022**, arXiv:111.09886. [Google Scholar] - He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. arXiv
**2021**, arXiv:2111.06377. [Google Scholar] - Pontieu, B.D.; Lemen, J. IRIS Technical Note 1: IRIS Operations; Version 17; LMSAL, NASA: Washington, DC, USA, 2013. [Google Scholar]
- LMSAL. A User’s Guide to IRIS Data Retrieval, Reduction & Analysis; Release 1.0; LMSAL, NASA: Washington, DC, USA, 2019. [Google Scholar]
- Gošic, M.; Dalda, A.S.; Chintzoglou, G. Optically Thick Diagnostics; Release 1.0 ed.; LMSAL, NASA: Washington, DC, USA, 2018. [Google Scholar]
- Panos, B.; Kleint, L. Real-time Flare Prediction Based on Distinctions between Flaring and Non-flaring Active Region Spectra. Astrophys. J.
**2020**, 891, 17. [Google Scholar] [CrossRef] - Gherrity, M. A learning algorithm for analog, fully recurrent neural networks. In Proceedings of the International 1989 Joint Conference on Neural Networks, Washington, DC, USA, 16–18 October 1989; Volume 1, pp. 643–644. [Google Scholar]
- Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. In Proceedings of the International Conference on Learning Representations (ICLR ’18), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- California, S.o. Performance Measurement System (PeMS) Data Source. Available online: https://pems.dot.ca.gov/ (accessed on 20 February 2023).
- Hanssen, A.; Kuipers, W. On the relationship between the frequency of rain and various meteorological parameters. Meded. En Verh.
**1965**, 81, 3–15. Available online: https://cdn.knmi.nl/knmi/pdf/bibliotheek/knmipubmetnummer/knmipub102-81.pdf (accessed on 20 February 2023). - Heidke, P. Berechnung des Erfolges und der Gute der Windstarkevorhersagen im Sturmwarnungsdienst (Measures of success and goodness of wind force forecasts by the gale-warning service). Geogr. Ann.
**1926**, 8, 301–349. [Google Scholar] - Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas.
**1960**, 20, 213–220. [Google Scholar] [CrossRef] - Allouche, O.; Tsoar, A.; Kadmon, R. Assessing the accuracy of species distribution models: Prevalence, kappa and the true skill statistic (TSS). J. Appl. Ecol.
**2006**, 43, 1223–1232. [Google Scholar] [CrossRef] - Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; Xu, Q. SCINet: Time Series Modeling and Forecasting with Sample Convolution and Interaction. arXiv
**2022**, arXiv:2106.09305. [Google Scholar] - Shao, Z.; Zhang, Z.; Wang, F.; Xu, Y. Pre-Training Enhanced Spatial-Temporal Graph Neural Network for Multivariate Time Series Forecasting. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; Association for Computing Machinery: New York, NY, USA, 2022. KDD ’22. pp. 1567–1577. [Google Scholar] [CrossRef]
- Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv
**2014**, arXiv:1406.1078. [Google Scholar] - Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. arXiv
**2014**, arXiv:1409.3215. [Google Scholar] - Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process.
**2004**, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Comparison of Markov chains for a selection of deep TS predictors: the blue parts correspond to the compressed representations of the time dimension. Some of these may accept additional inputs (correlated context) but we did not include them in these diagrams because that would overload the global understanding, and the time dimension is compressed in the same way. A bold $\mathbf{X}$ is used when the model accepts vectors as input.

**Figure 2.**Schematic analogy between the IB principle and image extension: (

**Left**) schematically shows the time prediction under the IB principle, with compression and decoding, using $PConv$ and $DPConv$ and skipping connections to form a variant of U-Net. (

**Right**) is an equivalent representation seen as the image extension, where the skipping layers connect ${\mathbf{X}}_{1:T}$ from the input to the output, and the bottleneck principle allows predicting ${\mathbf{X}}_{T+1:T+F}$ from ${\mathbf{X}}_{1:T}$.

**Figure 3.**Problem formulation: $(x,y)$ represent the spatial coordinates, ${\lambda}$ and t, respectively, represent the spectral and time coordinates. NASA’s IRIS satellite integrates a mirror from which the Sun image or videos are captured by a sensor paired with a wavelength filter chosen among $1330\phantom{\rule{3.33333pt}{0ex}}\mathsf{\AA}$, $1400\phantom{\rule{3.33333pt}{0ex}}\mathsf{\AA}$, $2796\phantom{\rule{3.33333pt}{0ex}}\mathsf{\AA}$, and $2832\phantom{\rule{3.33333pt}{0ex}}\mathsf{\AA}$. This mirror holds a vertical slit from which the diffraction occurs. The x position of the slit can vary in time and is chosen before the observation. A sensor behind the mirror captures the Sun spectra for each vertical position y of the Sun’s image, but only at the x position of the slit. We only consider the MgIIh/k data, which are between ${{\lambda}}_{1}=2793.8401\phantom{\rule{3.33333pt}{0ex}}\mathsf{\AA}$ and ${{\lambda}}_{2}=2806.02\phantom{\rule{3.33333pt}{0ex}}\mathsf{\AA}$, and we consider all available time sequences.

**Figure 5.**Structure of the ED-LSTM and ED-GRU models used for comparison. ${\mathbf{C}}_{t}^{i}$ represents the hidden state vectors for GRU cells, combined with cell state vectors for LSTM cells.

**Figure 6.**Evaluations performed on the proposed time predictor: center assignments, activity classification, and physical features. Classical MTS and CV evaluations were also performed without appearing in this diagram for readability concerns.

**Figure 7.**Evaluation of predictions for one flaring (FL) sample performed by the proposed IB-MTS model. The

**first row**contains, respectively, the masked input, the predicted output, the genuine data, and the magnified pixel-wise error between the predicted and genuine.

**Second row**: Spectral center distribution for the prior, the predicted, and the genuine MTS.

**Third row**: MTS evaluation on the prediction.

**Last twelve plots**: astrophysical features evaluations; the dotted blues represent the genuine and the green lines represent the prediction.

**Figure 8.**Prediction results: The first column presents the results of the direct predictions (blue part) and the second column presents the iterated predictions (violet part). A masked sample is given from the original sequence (

**first row**); the prediction (

**second row**) and the magnified ($\times 5$) differences (

**third row**) are shown.

**Figure 9.**MTS metrics evaluation averaged on the test set for the direct prediction setups on QS, AR, and FL IRIS data.

**Figure 10.**MTS metrics evaluation averaged on the test set for the iterated prediction setups on QS, AR, and FL IRIS data.

**Figure 12.**CV evaluation (over time) of the forecast for the direct and iterated predictions on IRIS data.

**Figure 13.**Average distributions of centroids with their standard deviations as vertical gray error bars. The first graph is for the average prior central data, the middle graph is for the average genuine target, and the right graph is the average distribution of predictions performed with IB-MTS.

**Figure 16.**Evaluation of the relative prediction errors for physical features over time of the forecasts for IRIS data and the direct setup. The lower the better.

**Figure 17.**Evaluation of the relative prediction errors for physical features over time of the forecasts for IRIS data and the iterated setup.

IB-MTS | LSTM | ED-LSTM | GRU | ED-GRU | N-BEATS | |
---|---|---|---|---|---|---|

Parameters | 65,714,097 | 461,760 | 401,840 | 347,040 | 308,640 | 100,800 |

Trained step (ms/sample) | 91 | 83 | 98 | 87 | 98 | 424 |

Data | Model | IB-MTS | LSTM | ED-LSTM | GRU | ED-GRU | NBeats | |
---|---|---|---|---|---|---|---|---|

IRIS | direct | MAE | $\mathbf{0.04}$ | $0.05$ | $0.05$ | $0.05$ | $\mathbf{0.04}$ | $0.10$ |

MAPE | $\mathbf{2.76}$ | $13.13$ | $4.71$ | $26.84$ | $3.16$ | $4.75$ | ||

RMSE | $\mathbf{0.07}$ | $0.08$ | $0.08$ | $0.08$ | $\mathbf{0.07}$ | $0.19$ | ||

iterated | MAE | $\mathbf{0.05}$ | $0.06$ | $\mathbf{0.05}$ | $0.06$ | $\mathbf{0.05}$ | $0.13$ | |

MAPE | $\mathbf{2.94}$ | $12.35$ | $3.53$ | $26.32$ | $3.22$ | $6.47$ | ||

RMSE | $\mathbf{0.08}$ | $0.09$ | $0.09$ | $0.10$ | $\mathbf{0.08}$ | $0.22$ | ||

AL | direct | MAE | $\mathbf{0.08}$ | $0.10$ | $\mathbf{0.08}$ | $0.10$ | $0.09$ | $0.11$ |

MAPE | $\mathbf{3.71}$ | $5.27$ | $4.58$ | $5.50$ | $5.20$ | $6.56$ | ||

RMSE | $0.15$ | $0.16$ | $\mathbf{0.14}$ | $0.16$ | $0.16$ | $0.18$ | ||

iterated | MAE | $\mathbf{0.08}$ | $0.19$ | $0.16$ | $0.23$ | $0.16$ | $0.11$ | |

MAPE | $\mathbf{4.00}$ | $11.37$ | $9.10$ | $12.94$ | $9.28$ | $6.23$ | ||

RMSE | $\mathbf{0.15}$ | $0.26$ | $0.23$ | $0.30$ | $0.23$ | $0.18$ | ||

PB | direct | MAE | $\mathbf{0.19}$ | $0.46$ | $0.46$ | $0.50$ | $0.46$ | $0.22$ |

MAPE | $\mathbf{4.47}$ | $10.15$ | $10.03$ | $10.76$ | $10.14$ | $5.19$ | ||

RMSE | $\mathbf{0.24}$ | $0.54$ | $0.51$ | $0.60$ | $0.52$ | $0.28$ | ||

iterated | MAE | $0.24$ | $0.45$ | $0.45$ | $0.45$ | $0.45$ | $\mathbf{0.23}$ | |

MAPE | $\mathbf{4.23}$ | $10.00$ | $9.98$ | $10.01$ | $9.98$ | $5.51$ | ||

RMSE | $0.30$ | $0.51$ | $0.500$ | $0.51$ | $0.50$ | $\mathbf{0.28}$ |

Dataset | Metric | IB-MTS | LSTM | ED-LSTM | GRU | ED-GRU | NBeats | |
---|---|---|---|---|---|---|---|---|

IRIS | direct | PSNR | 27.2 | $25.6$ | $26.3$ | $25.9$ | $26.7$ | $14.6$ |

SSIM | $\mathbf{0.897}$ | $0.869$ | $0.887$ | $0.864$ | $0.891$ | $0.673$ | ||

iterated | PSNR | $\mathbf{23.8}$ | $23.0$ | $23.4$ | $22.8$ | $\mathbf{23.8}$ | $13.4$ | |

SSIM | $\mathbf{0.868}$ | $0.821$ | $0.864$ | $0.809$ | $\mathbf{0.868}$ | $0.586$ | ||

AL | direct | PSNR | $17.2$ | $16.0$ | $\mathbf{17.4}$ | $15.9$ | $16.4$ | $15.7$ |

SSIM | $\mathbf{0.518}$ | $0.400$ | $0.488$ | $0.377$ | $0.401$ | $0.346$ | ||

iterated | PSNR | $\mathbf{16.7}$ | $11.8$ | $13.0$ | $10.6$ | $13.1$ | $15.2$ | |

SSIM | $\mathbf{0.516}$ | $0.046$ | $0.198$ | $0.023$ | $0.166$ | $0.361$ | ||

PB | direct | PSNR | $\mathbf{12.5}$ | $5.4$ | $6.0$ | $4.5$ | $5.7$ | $11.4$ |

SSIM | $0.361$ | $0.013$ | $0.000$ | $0.004$ | $0.000$ | $\mathbf{0.470}$ | ||

iterated | PSNR | $10.5$ | $5.9$ | $6.0$ | $5.9$ | $6.0$ | $\mathbf{11.0}$ | |

SSIM | $0.235$ | $0.003$ | $0.003$ | $0.003$ | $0.004$ | $\mathbf{0.472}$ |

**Table 4.**Information comparison between the prior centroids c0, the genuine centroids c1, and the predicted centroids c2 for the IRIS dataset. The results on H, $KL$, and I are averaged over all testing samples. $H\left({c}_{0}\right)$, $KL\left({c}_{0}\right|\left|{c}_{1}\right)$, and $H\left({c}_{1}\right)$, being statistics on the prior and the genuine, do not depend on the method.

Dataset | Metric | IB-MTS | LSTM | ED-LSTM | GRU | ED-GRU | NBeats | |
---|---|---|---|---|---|---|---|---|

IRIS | direct | $H\left({c}_{0}\right)$ | $3.922$ | $3.922$ | $3.922$ | $3.922$ | $3.922$ | $3.922$ |

$KL\left({c}_{0}\right|\left|{c}_{1}\right)$ | $0.005$ | $0.005$ | $0.005$ | $0.005$ | $0.005$ | $0.005$ | ||

$KL\left({c}_{0}\right|\left|{c}_{2}\right)$ | $0.111$ | $0.311$ | $0.111$ | $0.344$ | $0.198$ | $0.000$ | ||

$H\left({c}_{1}\right)$ | $3.890$ | $3.890$ | $3.890$ | $3.890$ | $3.890$ | $3.890$ | ||

$H\left({c}_{2}\right)$ | $3.567$ | $3.382$ | $3.579$ | $3.280$ | $3.465$ | $0.000$ | ||

$I({c}_{1};{c}_{2})$ | $\mathbf{1.753}$ | $1.607$ | $1.613$ | $1.597$ | $1.681$ | $0.000$ | ||

iterated | $H\left({c}_{0}\right)$ | $3.968$ | $3.968$ | $3.968$ | $3.968$ | $3.968$ | $3.968$ | |

$KL\left({c}_{0}\right|\left|{c}_{1}\right)$ | $0.003$ | $0.003$ | $0.003$ | $0.003$ | $0.003$ | $0.003$ | ||

$KL\left({c}_{0}\right|\left|{c}_{2}\right)$ | $\mathbf{0.288}$ | $0.575$ | $0.308$ | $0.416$ | $0.486$ | $0.000$ | ||

$H\left({c}_{1}\right)$ | $3.957$ | $3.957$ | $3.957$ | $3.957$ | $3.957$ | $3.957$ | ||

$H\left({c}_{2}\right)$ | $3.487$ | $3.325$ | $3.385$ | $3.301$ | $3.229$ | $0.000$ | ||

$I({c}_{1};{c}_{2})$ | $\mathbf{1.352}$ | $1.276$ | $1.319$ | $1.219$ | $1.314$ | $0.000$ |

**Table 5.**Evaluation on IRIS data: Percentage accuracies in terms of k-NN for direct prediction of the sizes of the training data and iterated prediction using a basic sliding window approach. The random k-NN cluster assignment accuracy is given for comparison and corresponds to the worst that can be expected for each k-NN assignment.

Metric | Rand${}_{\mathit{k}-\mathbf{NN}}$ | IB-MTS | LSTM | ED-LSTM | GRU | ED-GRU | NBeats | |
---|---|---|---|---|---|---|---|---|

direct | 1-NN | $1.9$ | $\mathbf{55.7}$ | $50.5$ | $51.4$ | $50.8$ | $54.1$ | $0.0$ |

2-NN | $7.5$ | $\mathbf{81.9}$ | $79.5$ | $77.8$ | $79.4$ | $80.5$ | $3.8$ | |

3-NN | $16.3$ | $\mathbf{94.3}$ | $93.0$ | $91.8$ | $93.2$ | $93.1$ | $15.8$ | |

4-NN | $27.6$ | $\mathbf{97.7}$ | $97.3$ | $96.3$ | $97.4$ | $97.0$ | $19.0$ | |

5-NN | $40.3$ | $\mathbf{98.9}$ | $98.8$ | $98.2$ | $\mathbf{98.9}$ | $98.6$ | $19.7$ | |

6-NN | $53.2$ | $\mathbf{99.6}$ | $99.5$ | $99.1$ | $99.5$ | $99.4$ | $21.1$ | |

iterated | 1-NN | $1.9$ | $\mathbf{45.8}$ | $42.6$ | $43.6$ | $43.4$ | $45.0$ | $0.0$ |

2-NN | $7.5$ | $\mathbf{73.8}$ | $72.1$ | $70.1$ | $72.0$ | $72.4$ | $3.9$ | |

3-NN | $16.3$ | $\mathbf{89.4}$ | $88.9$ | $87.0$ | $88.3$ | $87.8$ | $16.1$ | |

4-NN | $27.6$ | $95.1$ | $\mathbf{95.2}$ | $93.7$ | $94.5$ | $94.4$ | $19.4$ | |

5-NN | $40.3$ | $97.6$ | $\mathbf{97.8}$ | $96.8$ | $97.3$ | $97.2$ | $20.2$ | |

6-NN | $53.2$ | $98.9$ | $\mathbf{99.0}$ | $98.5$ | $98.6$ | $98.7$ | $21.8$ |

Model | Metric | IB-MTS | LSTM | ED-LSTM | GRU | ED-GRU | N-BEATS |
---|---|---|---|---|---|---|---|

direct | |||||||

Global | % Accuracy | $\mathbf{55.7}$ | $50.5$ | $51.4$ | $50.8$ | $54.1$ | $0.0$ |

TSS | $\mathbf{0.49}$ | $0.43$ | $0.45$ | $0.43$ | $0.47$ | $0.00$ | |

HSS | $\mathbf{0.50}$ | $0.44$ | $0.45$ | $0.44$ | $0.48$ | $0.00$ | |

QS | % Accuracy | $\mathbf{52.5}$ | $47.6$ | $47.7$ | $48.6$ | $49.0$ | $0.0$ |

TSS | $\mathbf{0.26}$ | $0.18$ | $0.21$ | $0.19$ | $0.22$ | $0.00$ | |

HSS | $\mathbf{0.28}$ | $0.20$ | $0.22$ | $0.20$ | $0.22$ | $0.00$ | |

AR | % Accuracy | $49.5$ | $44.8$ | $45.7$ | $46.0$ | $\mathbf{49.9}$ | $0.0$ |

TSS | $\mathbf{0.43}$ | $0.37$ | $0.39$ | $0.39$ | $\mathbf{0.43}$ | $0.00$ | |

HSS | $\mathbf{0.43}$ | $0.37$ | $0.39$ | $0.39$ | $\mathbf{0.43}$ | $0.00$ | |

FL | % Accuracy | $\mathbf{63.7}$ | $57.3$ | $60.3$ | $56.3$ | $63.5$ | $0.0$ |

TSS | $\mathbf{0.58}$ | $0.51$ | $0.53$ | $0.49$ | $0.57$ | $0.00$ | |

HSS | $\mathbf{0.58}$ | $0.51$ | $0.53$ | $0.49$ | $0.57$ | $0.00$ | |

iterated | |||||||

Global | % Accuracy | $40.4$ | $36.4$ | $\mathbf{41.8}$ | $37.4$ | $40.0$ | $0.0$ |

TSS | $0.33$ | $0.29$ | $\mathbf{0.35}$ | $0.30$ | $0.32$ | $0.00$ | |

HSS | $0.34$ | $0.29$ | $\mathbf{0.35}$ | $0.30$ | $0.32$ | $0.00$ | |

QS | % Accuracy | $\mathbf{46.3}$ | $43.4$ | $42.7$ | $44.7$ | $43.9$ | $0.0$ |

TSS | $\mathbf{0.15}$ | $0.10$ | $0.12$ | $0.10$ | $0.12$ | $0.00$ | |

HSS | $\mathbf{0.17}$ | $0.11$ | $0.13$ | $0.12$ | $0.14$ | $0.00$ | |

AR | % Accuracy | $37.2$ | $38.2$ | $34.8$ | $39.0$ | $\mathbf{41.3}$ | $0.0$ |

TSS | $0.30$ | $0.30$ | $0.26$ | $0.31$ | $\mathbf{0.33}$ | $0.00$ | |

HSS | $0.30$ | $0.30$ | $0.26$ | $0.31$ | $\mathbf{0.33}$ | $0.00$ | |

FL | % Accuracy | $33.0$ | $24.8$ | $\mathbf{42.6}$ | $26.3$ | $31.8$ | $0.0$ |

TSS | $0.24$ | $0.18$ | $\mathbf{0.33}$ | $0.18$ | $0.22$ | $0.00$ | |

HSS | $0.24$ | $0.17$ | $\mathbf{0.33}$ | $0.17$ | $0.22$ | $0.00$ |

**Table 7.**Accuracy of solar activity classifications for the predicted versus genuine MTS with the direct and iterated prediction setups.

Model (Count) | Metric | IB-MTS | LSTM | ED-LSTM | GRU | ED-GRU | N-BEATS |
---|---|---|---|---|---|---|---|

direct | |||||||

Global (8000) | % Acc | $\mathbf{95}$ | $\mathbf{95}$ | $\mathbf{95}$ | $\mathbf{95}$ | $\mathbf{95}$ | 88 |

TSS | $\mathbf{0.911}$ | $\mathbf{0.911}$ | $0.906$ | $0.910$ | $0.909$ | $0.805$ | |

HSS | $\mathbf{0.915}$ | $0.906$ | $0.911$ | $0.905$ | $0.914$ | $0.785$ | |

QS (3680) | % Acc | $\mathbf{97}$ | 96 | 96 | 96 | 96 | 94 |

TSS | $\mathbf{0.938}$ | $0.911$ | $0.916$ | $0.910$ | $0.918$ | $0.876$ | |

HSS | $\mathbf{0.936}$ | $0.911$ | $0.915$ | $0.910$ | $0.917$ | $0.874$ | |

AR (536) | % Acc | 96 | 96 | $\mathbf{97}$ | 96 | $\mathbf{97}$ | 91 |

TSS | $\mathbf{0.613}$ | $0.401$ | $0.327$ | $0.400$ | $0.311$ | $0.000$ | |

HSS | $\mathbf{0.640}$ | $0.371$ | $0.366$ | $0.362$ | $0.349$ | $0.000$ | |

FL (3784) | % Acc | 98 | $\mathbf{99}$ | 98 | $\mathbf{99}$ | $\mathbf{99}$ | 92 |

TSS | $0.958$ | $0.971$ | $0.965$ | $\mathbf{0.972}$ | $\mathbf{0.972}$ | $0.838$ | |

HSS | $0.959$ | $0.971$ | $0.965$ | $\mathbf{0.972}$ | $\mathbf{0.972}$ | $0.843$ | |

iterated | |||||||

Global (8000) | % Acc | 94 | 94 | 93 | 94 | $\mathbf{95}$ | 86 |

TSS | $\mathbf{0.979}$ | $0.899$ | $0.870$ | $0.896$ | $0.895$ | $0.768$ | |

HSS | $0.889$ | $0.891$ | $0.869$ | $0.889$ | $\mathbf{0.901}$ | $0.738$ | |

QS (3680) | % Acc | $\mathbf{96}$ | 95 | 95 | 95 | 95 | 88 |

TSS | $\mathbf{0.914}$ | $0.903$ | $0.891$ | $0.892$ | $0.907$ | $0.777$ | |

HSS | $\mathbf{0.915}$ | $0.903$ | $0.893$ | $0.891$ | $0.908$ | $0.767$ | |

AR (536) | % Acc | 94 | 95 | 95 | $\mathbf{96}$ | $\mathbf{96}$ | 92 |

TSS | $\mathbf{0.544}$ | $0.387$ | $0.188$ | $0.399$ | $0.277$ | $0.168$ | |

HSS | $\mathbf{0.594}$ | $0.331$ | $0.185$ | $0.346$ | $0.311$ | $0.113$ | |

FL (3784) | % Acc | $\mathbf{98}$ | $\mathbf{98}$ | 97 | $\mathbf{98}$ | $\mathbf{98}$ | 91 |

TSS | $0.948$ | $0.955$ | $0.930$ | $\mathbf{0.960}$ | $0.957$ | $0.932$ | |

HSS | $0.949$ | $0.957$ | $0.930$ | $\mathbf{0.962}$ | $0.957$ | $0.920$ |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ullmann, D.; Taran, O.; Voloshynovskiy, S.
Multivariate Time Series Information Bottleneck. *Entropy* **2023**, *25*, 831.
https://doi.org/10.3390/e25050831

**AMA Style**

Ullmann D, Taran O, Voloshynovskiy S.
Multivariate Time Series Information Bottleneck. *Entropy*. 2023; 25(5):831.
https://doi.org/10.3390/e25050831

**Chicago/Turabian Style**

Ullmann, Denis, Olga Taran, and Slava Voloshynovskiy.
2023. "Multivariate Time Series Information Bottleneck" *Entropy* 25, no. 5: 831.
https://doi.org/10.3390/e25050831