Multimedia Data Modelling Using Multidimensional Recurrent Neural Networks

: Modelling the multimedia data such as text, images, or videos usually involves the analysis, prediction, or reconstruction of them. The recurrent neural network (RNN) is a powerful machine learning approach to modelling these data in a recursive way. As a variant, the long short-term memory (LSTM) extends the RNN with the ability to remember information for longer. Whilst one can increase the capacity of LSTM by widening or adding layers, additional parameters and runtime are usually required, which could make learning harder. We therefore propose a Tensor LSTM where the hidden states are tensorised as multidimensional arrays (tensors) and updated through a cross-layer convolution . As parameters are spatially shared within the tensor, we can efﬁciently widen the model without extra parameters by increasing the tensorised size; as deep computations of each time step are absorbed by temporal computations of the time series, we can implicitly deepen the model with little extra runtime by delaying the output. We show by experiments that our model is well-suited for various multimedia data modelling tasks, including text generation, text calculation, image classiﬁcation, and video prediction.


Introduction
Multimedia data such as text, images, and videos are ubiquitous nowadays.Modelling such data usually involves the analysis, prediction, or reconstruction of them.For instance, text modelling relates to many natural language processing tasks such as sentiment analysis [1], part-of-speech tagging [2], machine translation [3], and question answering [4], image modelling relates to many computer vision tasks such as image segmentation [5], depth reconstruction [6], image generation [7], and super-resolution [8], and video modelling also relates to many computer vision tasks such as object tracking [9], video segmentation [10], motion estimation [11], and video prediction [12].Although they are diverse, these tasks usually can be formulated as a time series prediction problem, e.g., generating a desired output y t for a given time series x 1:t = {x 1 , x 2 , • • • , x t }, for time t = 1, 2, . . ., T, where x t ∈ R U and y t ∈ R V are vectors (In this paper, we assume the vectors are in row form).The recurrent neural network (RNN) [13,14] is a popular model that can learn to encapsulate the useful information of the input history x 1:t into a hidden state vector h t ∈ R M .By concatenating the input x t to the previous hidden state h t−1 , we first get h cat t−1 ∈ R U+M : The hidden state h t is then updated by: where W h ∈ R (U+M)×M and b h ∈ R M are parameters namely the weight and bias, respectively; a t ∈ R M is the activation for h t , and φ(•) is the tanh function (element-wise).The RNN finally produces an output y t for time t: where W y ∈ R M×V and b y ∈ R V , and ϕ(•) is a differentiable transformation that depends on the task.Nevertheless, the standard RNN is notorious for capturing the long-term dependency caused by the vanishing and exploding gradients [15].Long Short-Term Memories (LSTMs) [16,17] mitigate this by (i) introducing the memory cell to store information longer, and (ii) utilising the gates for information routing.In a standard LSTM [17], the hidden state h t is updated as follows: where W h ∈ R (U+M)×4M and b h ∈ R 4M are parameters, a i t ,a o t ,a g t ,a f t ∈ R M are respectively the activations of the input gate i t , the output gate o t , the new content g t , and the forget gate f t , c t ∈ R M is the updated memory cell, σ(•) is the sigmoid function (element-wise), and is the element-wise multiplication.Since LSTM is successful in modelling time series, it is natural to further increase its capacity so that it could be profitably applied to a wider range of tasks.
We consider the width and the depth to compose a network's capacity, where the former measures how much information could be processed in parallel, while the latter measures how many computation steps are required for processing [18].Whilst using more hidden units in a layer can widen the LSTM, it scales the parameter number quadratically.On the other hand, the Stacked LSTM (sLSTM) deepens the LSTM by using multiple layers [19]; however, the runtime scales linearly with the layer number and the input information is likely to be lost when it vertically passes through the LSTM layers (caused by vanishing/exploding gradients).
The goal of this paper is to make the LSTM wider and deeper and meanwhile prevent its parameter number and runtime from growing.To summarize, we have the following contributions:

•
We represent RNN hidden states as multidimensional arrays (tensors) to allow more flexible parameter sharing, thereby being able to efficiently widen the network without extra parameters.

•
We use the temporal computations of the RNN to absorb its deep computations in order that we can deepen it without extra runtime.We call this novel RNN as the Tensor RNN (tRNN).

•
We propose a memory cell convolution and apply it to the tRNN in order to mitigate gradient vanishing and explosion, obtaining a Tensor LSTM (tLSTM).

•
We generalise the tLSTM so that it can process not only non-structured time series (series of vectors) [20], but also structured time series (series of tensors, such as videos).

•
We show by experiments that our model is well-suited for various multimedia data modelling tasks, including text generation, text calculation, image classification, and video prediction.

Tensor Representation
From (2), we can see that the parameter number of an RNN scales quadratically with its hidden size.To widen the network while restricting its parameter number, one can use tensor factorisation, where the parameters are represented by multidimensional tensors that could be factorised as low-rank subtensors containing much fewer elements [21][22][23][24][25][26][27][28][29].As the hidden state vector would be broadcast when interacting with the parameter tensor, the network is widened implicitly.One can also limit the parameter number of an RNN by spatially sharing a small group of parameters within its hidden state, analogous to the convolutional neural network (CNN) [30,31].
Here, we use parameter sharing to reduce the number of parameters in RNNs.In contrast to tensor factorisation, it provides two benefits: (i) scalability-the size of hidden state would not affect the number of parameters; (ii) separability-we can carefully route the information via the receptive field control so that the RNN deep computations can be shifted into its temporal direction (Section 2.2).In addition, the hidden state vectors of RNN are explicitly tensorised, as tensors are more: (i) flexible-we can choose the dimensions for parameter sharing and then just enlarge the sizes of these dimensions so that no more parameters are introduced; (ii) efficient-by using tensors of higher dimensionality, we can widen the network faster when the number of parameters is fixed (Section 2.3).
To ease explanation, let's firstly focus on 2D tensors (matrices).Given a hidden state h t ∈ R M , we tensorise it as H t ∈ R P×M , where P and M are respectively the tensorised size and channel size.In H t , the 1st dimension is locally-connected for parameter sharing, and the 2nd dimension is fully-connected for global interaction-like in CNN where only the last dimension is fully-connected so that different feature planes (e.g., red/green/blue channels for the input image) can be globally fused.In addition, when comparing H t with the hidden state in the Stacked RNN (sRNN) (as in Figure 1a), P can be thought as the layer number, and M can be thought as each layer's size.To explain our model, we start with 2D tensors, and then demonstrate how to use higher dimensional tensors to enhance the model, and finally show how to extend the model to deal with structured input time series.

Deep Computation through Time
As RNN is already deep when unfolded in time, we can associate the input x t with a future (delayed) output to also make the input-to-output computation deep.To achieve this, we should guarantee that the output y t is separable, i.e., it is independent of the future input x t:T .Therefore, we first stack x t 's projection on top of H t−1 ; then move the input content downwards through the temporal computation, and finally produce y t from the bottom of a future hidden state H t+L−1 , where L−1 denotes the delayed time steps and L denotes the depth.Figure 1b shows an example with L = 3, which can be thought as a skewed sRNN that is mentioned in [7,32].However, in our implementation, the network structure does not need to be changed, and various interactions are also allowed if the output satisfies the separability.For instance, we can use wider local connections or introduce feedback connections (as in Figure 1c) to improve the model (similar to [33]).In addition, we update H t by convolving it with a learnable kernel so that the parameters can be shared.In doing this, we have increased the input-output mapping complexity (via output delay) and limited the growth of the parameter number (via parameter sharing by convolution).
To define the above described tRNN, we denote the concatenated hidden state as H cat t−1 ∈ R (P+1)×M , and the location at a tensor as p ∈ Z + .At location p of H cat t−1 , the channel vector h cat t−1,p ∈ R M satisfies: where W x ∈ R U×M and b x ∈ R M .The hidden state H t is then updated through a convolution: where W h ∈ R K×M i ×M o and b h ∈ R M o are the kernel's weight and bias, respectively, with K denoting the kernel size, M i = M the input channel, and M o = M the output channel, A t ∈ R P×M o denotes the activation of H t , and denotes the convolution operation (detailed in Appendix A.1).As kernels convolve across different layers of the hidden state, we call the convolution as a cross-layer convolution, which allows the interaction among layers (from both top-down and bottom-up).Finally, the channel vector h t+L−1,P ∈ R M , located at the bottom of H t+L−1 , is used to generate y t : where W y ∈ R M×V and b y ∈ R V .To ensure y t 's receptive field only covers historical inputs x 1:t (as in Figure 1c), a constraint among L, P, and K needs to be satisfied: where % is the modulus operator and • denotes the ceil operation.Please see in Appendix B for the derivation of (13).We call the RNN described in ( 9)-( 12) as a Tensor RNN (tRNN), where one can increase the tensorised size P to widen the model, and meanwhile keep the number of parameters fixed (by using convolution).Moreover, different from the sRNN with a runtime complexity of O(TL), the runtime complexity of a tRNN is broken down to O(T+L), indicating that the runtime would not be significantly affected by T or L.

Using LSTMs
To capture the long-term dependency across different time steps, the tRNN can be straightforwardly extended with LSTM by modifying (10) and (11) as: where {W h , b h } is the kernel with kernel size K, input channel M i = M, and output channel are respectively the activations of the input gate I t , the output gate O t , the new content G t , and the forget gate F t , and C t ∈ R P×M is the updated memory cell.However, as (16) only gates the previous memory cell C t−1 along the temporal direction (as in Figure 1d), when the tensorised size P grows large, the long-term dependency from the input-output direction is likely to be lost.

Memory Cell Convolution
Here, we propose a novel memory cell convolution (memConv) for capturing the long-term dependency from multiple directions, where, like the hidden state, the memory cell could also have a wider receptive field (as in Figure 1e).In addition, the kernel for memConv is generated on the fly and therefore varies with time and location, flexibly controlling the long-term dependency from different directions.Concretely, we define the tensor update for tLSTM as follows: where, unlike ( 14)-( 17), the kernel {W h , b h } contains additional K output channels ( • computes the cumulative product of the input variable elements) for generating the activation A q t ∈ R P× K of the dynamic kernel bank Q t ∈ R P× K , q t,p ∈ R K denotes the vectorised dynamic kernel selected from Q t 's entry p, and W c t (p) ∈ R K×1×1 is the dynamic kernel reshaped from q t,p (illustrated in Figure 2a), with a size K and a single input/output channel.Equation ( 21) defines the memConv (detailed in Appendix A.2), where we use W c t (p), the value of which varies with p, to convolve every channel of C t−1 , producing a convolved memory cell C conv t−1 ∈ R P×M .Analogous to [34], in (19), a softmax function ς(•) is employed to normalise Q t along its channel dimension, which can stabilise the memory cell values and thereby mitigate the vanishing and exploding gradients (please check Appendix C for more discussion).
There are many works [22,23,27,[35][36][37] using the concept of dynamically producing the model weights, where [36] also dynamically generates location-dependent convolution kernels for improving the CNN.Unlike these works, we aim to broaden the receptive fields for tLSTM memory cells.Whilst being flexible, fewer parameters are needed for generating the memConv kernel as it can be shared by different channels of the memory cell.

Channel Normalisation
We adapt the recently proposed layer normalisation (LN) [38] to speed-up the training of tLSTM.In [38], LN has been observed unsuitable for the CNN where different statistics are possessed by different channel vectors.Similarly, we have found that LN also performs not well for the tLSTM, in which low-level information is possessed by channel vectors close to the input and vice versa.Therefore, we propose a channel normalisation (CN), which normalises each channel vector independently.The CN operator is defined as: where Γ, B, Z, Z ∈ R P×M z , Γ and B are parameters namely the gain and bias, respectively, Z is the input tensor, and Z is the normalised tensor.Let z m z ∈ R P be the m z -th channel of Z, it is normalised element-wisely: where z µ , z σ ∈ R P are respectively the mean and the standard deviation which are computed along Z's channel dimension, and z m z ∈ R P denotes the m z -th channel in Z.As the parameter number introduced by CN/LN is quite small in terms of the model parameters, it could be reasonably neglected.

Leveraging Higher-Dimensional Tensors
In (13), we can see that given the kernel size K, the tensorised size P scales linearly w.r.t. the depth L. To widen the tLSTM more efficiently, we resort to using higher-dimensional tensors, where the tensor volume can be expanded more rapidly.Based on the tLSTM defined in previous sections, we can generalise the tensors from 2D to (D+1)-dimensional where D > 1, resulting in H t , C t ∈ R P 1 ×P 2 ×...×P D ×M with the tensorised size P = [P 1 , P 2 , . . . ,P D ].As the hidden states are more than 2D, we instead concatenate x t 's projection to the corner of H t−1 , thereby extending (9) as: where the channel vector h cat t−1,p ∈ R M is the entry p ∈ Z D + of the concatenated hidden state H cat t−1 ∈ R (P 1 +1)×(P 2 +1)×...×(P D +1)×M .Accordingly, the output y t is generated at the opposite corner of H t+L−1 , thus we modify (12) as: To update the hidden state, we also tensorise the convolution kernel W h and W c t (•) so that they have a kernel size of K = [K 1 , K 2 , . . . ,K D ], where W c t (•) is still reshaped from the vector (see Figure 2b).In order to make every dimension of P and K meet the constraint (13) with a same L, we set P d = P and K d = K for d = 1, 2, . . ., D. For CN, it is still applied to normalise the channel dimension of tensors.

Handling Structured Inputs
Until now, we have limited our discussion to the case where the input at each time step is a vector, which is non-structured.However, as structured data (e.g., image time series) also emerges in many multimedia modelling tasks (e.g., video segmentation, motion estimation, and video prediction), it is essential to generalise the model to handle structured inputs.
We use X t ∈ R S 1 ×S 2 ×...×S E ×U to denote the structured input at time step t, where E ∈ Z + is the structure dimensionality and S = [S 1 , S 2 , . . ., S E ] the structure size, e.g., when X t is a 2D image, then S = [S 1 , S 2 ] is the image size (height and width) and U is the image depth (channel).Correspondingly, we have a hidden state H t ∈ R P 1 ×P 2 ×...×P D ×S 1 ×S 2 ×...×S E ×M .In contrast to (26), we define the sub-tensor H cat t−1,p ∈ R S 1 ×S 2 ×...×S E ×M locating at entry p of H cat t−1 as: where the convolution kernel {W x , b x } is used for linear projection and is of size 1 ∈ R E , with U input channels and M output channels.
To update the hidden state tensor, the size of convolution kernels W h and where the first D dimensions, K 1:D , are related to the tensorised size P, and the succeeding E dimensions, K D+1:D+E , are related to the structure size E.This also means that K D+1:D+E are free of the constraint (13).
Finally, we generate the output from the sub-tensor H t+L−1,P ∈ R S 1 ×S 2 ×...×S E ×M .Note that for many tasks such as video prediction, the output usually has the same structure (i.e., a same S) as the input.In this case, the output can be generated by: where Y t ∈ R S 1 ×S 2 ×...×S E ×V is the structured output and {W y , b y } is the convolution kernel of size 1 ∈ R E , with M input channels and V output channels.In addition, it is straightforward to generate a non-structured output y t ∈ R V from H t+L−1,P , e.g., by using a CNN or a fully-connected network.

Convolutional LSTMs
The Convolutional LSTM (cLSTM) parallelises the computation of LSTM where at each time step the input is structured (as in Figure 3a), such as an array vector [7], a matrix of vectors [39][40][41][42], and a tensor of vectors [43,44].Different from the cLSTM, tLSTM focuses on increasing the capacity of LSTM where each input can also be non-structured (a single vector), and has the following advantages: (i) the convolution in tLSTM is performed across different hidden layers, the structure of which can be different from the input structure, integrating information top-down and bottom-up, whereas the convolution in cLSTM is only performed within each hidden layer, the structure of which depends on the input structure, thereby falling back to the standard LSTM when each input is a single vector; (ii) by increasing the tensorised size, one can efficiently widen the tLSTM without introducing more parameters, whereas to widen the cLSTM, either increasing the kernel size or kernel channel can significantly increase the parameter number; (iii) by delaying the output, one can deepen the tLSTM with little additional runtime, whereas to deepen the cLSTM, increasing the number of hidden layers can significantly increase the runtime; (iv) with the memConv, tLSTM can capture the long-term dependency of multiple directions, whereas, cLSTM only gates the memory cell along one direction, thereby struggling to capture the long-term dependency of multiple directions.[47] with three layers; (e) the quasi-recurrent neural network [48] with three layers and a kernel size of 2, where temporal convolution is utilised to parallelise costly computations.

Deep LSTMs
The Deep LSTM (dLSTM) improves the sLSTM by further deepening it (as shown in Figure 3b-d).
In order to limit the number of parameters as well as make training easy, in [46,47,49,50], another RNN/LSTM is applied to the depth direction of dLSTMs.However, the runtime is still multiplied by the depth.Though deep computations are accelerated in [32,51], they mainly focus on simple architectures, e.g., sLSTMs.Unlike dLSTMs, in tLSTM, deep computations are performed with little extra runtime, and feedback is enabled by cross-layer convolutions.Furthermore, by utilising higher dimensional tensors, one can increase tLSTM's capacity with higher efficiency, while the whole stacked hidden layers in a dLSTM only compose a 2D tensor, whose dimensionality is fixed.

Other Parallelisation Methods
When full input and target sequences are available for training, temporal computations of the time series are parallelised (for instance, by using temporal convolutions like in Figure 3e) in [48,[52][53][54][55][56].Nevertheless, for online inference, since inputs are presented sequentially, these methods can no longer parallelise temporal computations, which will also be blocked by deep computations of each time step, rendering themselves not well-suited for the real-time application which requires a high sampling/output frequencies.On the contrary, as tLSTM performs deep computations through temporal computations, it can accelerate both training and online inference for many tasks.This is human-like: when converting the input signal into action, we simultaneously process newly arrived signals in a nonblocking manner.One should also notice that for some tasks (such as autoregressive sequence generation) which take the previous output y t−1 as the current input x t , tLSTM is unable to parallelise the deep computation for online inference, since additional L−1 time steps are required to generate y t−1 for each x t .

Experiments
To evaluate our tLSTM, we experiment on seven challenging multimedia data modelling tasks, and are interested in the following configurations: • sLSTM: We implement the sLSTM [45] and share the parameters for different layers.
To make different configurations comparable, for sLSTM, we use L and M to represent the layer number and the size of each layer, respectively.Let K be the value of the first D dimensions of the kernel size K, we set K = 2 for 1-tLSTM-F and K = 3 for other tLSTM configurations, so that according to (13), we have L = P.
To check if tLSTM's performance can be improved without using additional parameters, for each configuration, we use the same amount of parameters and meanwhile increase the tensorised size.We also inspect how the depth can affect the runtime, which is quantified as the averaged milliseconds cost by a single sample's forward and backward passes over a single RNN time step.Then, we evaluate tLSTM's ability by comparing it against the state-of-the-art methods.Finally, we analyze the inner working of tLSTM by visualising its memory cells.
The training objective is to minimise the training loss w.r.t. the parameter θ (vectorised), i.e., min where N is the number of training sequences, T n is the length of the n-th training sequence, and l(•, •) is the loss between the prediction and the target.We define l(•, •) as the Mean Squared Error (MSE) for regression problems (our video prediction tasks), and as the cross entropy for classification problems (our other tasks).In all tasks, the training objective is minimised by Adam [57] with a learning rate of 0.001.Forget gate biases are set to 4 for image classification tasks and 1 [58] for others.All models are implemented by Torch7 [59] and accelerated by cuDNN on Tesla K80 GPUs (NVIDIA, Santa Clara, CA, USA).We only apply CN to the output of the tLSTM hidden state as we have tried different combinations and found this is the most robust way that can always improve the performance for all tasks.With CN, the output of hidden state becomes: (31)

Text Generation
The dataset of Hutter Prize Wikipedia [60] is a text file comprising 100 million characters with a vocabulary size of 205, including alphabets, special symbols, and XML markups.This dataset is modelled at character-level, and the goal is generating the next character given all previous ones, e.g.,: We evaluate all configurations for the depth L = 1, 2, 3, 4 and use 10 M parameters, so that the channel size M for sLSTM and 1-tLSTM-F is 1120, for other 1-tLSTMs is 901, and for 2-tLSTMs is 522.Bits-per-character (BPC) are used for performance measuring.As in [33], we split the dataset into 90 M/5 M/5 M for training/validation/test.In each iteration, the model is fed with a mini-batch of 100 subsequences of length 50.During the forward pass, the hidden values at the last time step are preserved to initialise the next iteration.We terminate training after 50 epochs.
Figure 4 shows the results.With a larger M, sLSTM and 1-tLSTM-F perform better than other models when L ≤ 2. When L increases, sLSTM and 1-tLSTM-M boost their performances but get stuck when L ≥ 3, whereas, with the memConv, the performances of tLSTMs improve, finally surpassing sLSTM and 1-tLSTM-M.With L = 4, the performance of 1-tLSTM-F is exceeded by that of 1-tLSTM, which is exceeded by that of 2-tLSTM in turn.Whilst LN benefits 2-tLSTM only when L ≤ 2, CN consistently benefits 2-tLSTM with different L.
Note that, in each tLSTM configuration, the runtime is nearly constant and largely unaffected by L, while in sLSTM, the runtime is almost proportional to L.
To compare with the state-of-the-art methods, we evaluate a larger 2-tLSTM+CN on the test set, where L = 6 and M = 1200.The results are presented in Table 1.With 50.1 M parameters, our model achieves a BPC of 1.264, and is therefore competitive to the best results [47,50] with similar amount of parameters.

Text Calculation
(i) Addition: The goal of this task is adding two integers of 15-digit.The model firstly reads both integers, after which it predicts their sum, both in a sequential manner (i.e., one digit per time step).Following [46], we use the symbol '-' to delimit integers and pad the input and target sequences, e.g., The copy task is to reproduce 20 random symbols presented as a sequence, where 65 different symbols are used.As in the addition task, the symbol '-' is also used as a delimiter, e.g., Input : -7h@P}n$R&+0^(#4?w>5C---------------------Target : ---------------------7h@P}n$R&+0^(#4?w>5C-For the addition and copy tasks, we set M to 400 and 100, respectively, and evaluate each configuration for L = 1, 4, 7, 10.The prediction accuracy of symbols are used to measure the performance.Like in [46], for both tasks we randomly generate 5 M training samples and 100 test samples, and set the mini-batch size to 15. Training proceeds for at most one epoch (To simulate the online learning process, we use all training samples only once) and will be terminated if 100% test accuracy is achieved.
Results are shown in Figure 5.In both tasks, the performances of sLSTM and 1-tLSTM-M degrades with larger L. On the contrary, with L increasing, the performance of 1-tLSTM-F continues improving, and can be further boosted by using feedback, tensors of higher dimensionality, and CN, whilst LN improves the performance only when L = 1.Note that correct solutions can be found (when achieving 100% test accuracies) in both tasks because of their repetitive nature.From the experiments, we find that in the task of addition, 2-tLSTM+CN of L = 7 performs the best and solves the task using only 298 K training examples, whilst in the task of copy, 2-tLSTM+CN of L = 10 outperforms other configurations and copies perfectly using only 54 K training examples.Moreover, different from sLSTM, all tLSTMs' runtime can be largely independent of L.
On both tasks, the best performing configurations are further compared to the state-of-the-art methods.Table 2 reports the results.Our model solves the tasks of both addition and copy significantly faster (with fewer training examples) than others, being the new state-of-the-art.

Image Classification
The dataset of MNIST [31] comprising 70,000 handwritten digit images sized 28×28, which is divided into 50,000/10,000/10,000 for training/validation/test.For this dataset, there are two tasks: (i) Sequential MNIST: In this task, the model first sequentially reads the pixels in a scanline order, and then outputs the class of the digit contained in the image [62].It is a time series task of 784 time steps, where we generate the output from the last time step, thereby requiring to capture very long term temporal dependencies.
(ii) Sequential Permuted MNIST: To make the problem even harder, we generate a permuted MNIST (pMNIST) [63] by permuting the original image pixels with a fixed random order so that the long-term dependency can also exist in neighbouring pixels.
In both tasks, we evaluate all configurations with M = 100 and L = 1, 3, 5.We employ the classification accuracy to measure the model performance.We set the mini-batch size to 50, and use early stopping for training.The training loss is calculated at the last time step.
Figure 6 shows the results.Increasing the depth no longer benefits sLSTM and 1-tLSTM-M when L = 5, while the performance of 1-tLSTM can be boosted by a larger depth and tensorisation.However, the performance of 1-tLSTM seems not to be affected by removing the feedback connections.In addition, CN always improves 2-tLSTM and outperforms LN when L ≥ 3.With validation accuracies of 99.1% on MNIST and 95.6% on pMNIST, 2-tLSTM+CN with L = 5 outperforms all other configurations in both tasks.In tLSTMs, the runtime is little affected by L, and when L = 5, all tLSTMs runs faster than sLSTM.
As presented in Table 3, the best performing configurations are compared against the state-of-the-art methods.On sequential MNIST, 2-tLSTM+CN with L = 3 achieves 99.2% test accuracy, which is the same as the state-of-the-art one produced by the Dilated GRU [56].On sequential pMNIST, 2-tLSTM+CN with L = 5 achieves 95.7% test accuracy, approaching the state-of-the-art one of 96.7% which is obtained from the Dilated CNN [54] in [56].

Video Prediction
The task of video prediction aims at predicting the future frames of a video given the historical frames.It has a variety of applications such as environment simulation, dataset augmentation, and many computer vision tasks.The main challenge is that the model must capture both the spatial and the temporal relationships among data well.We apply our model to two datasets: (i) KTH [67]: The dataset consists of 600 real videos with 25 subjects performing six actions (walking, running, jogging, hand-clapping, hand-waving, and boxing).It has been split into a training set (subjects 1-16) and a test set (subjects 17-25), resulting in 383 and 216 sequences, respectively.We resize all frames to 128×128.(ii) UCF101 [68]: The dataset consists of 13,320 real videos of resolution 320×240 with 101 human actions that could be split into five types (sports, playing musical instruments, human-human interaction, body-motion only, and human-object interaction).It is currently the most challenging dataset of actions.Following [69], we train our models on Sports-1M [70] dataset and test them on UCF-101.
On both tasks, we evaluate all configurations with L = 1, 3, 5.To process the structured inputs (i.e., video frames), we modify the original sLSTM [45] by replacing each LSTM layer with a Convolutional LSTM [39], where the convolution kernel size is set to [5,5].We also set the last two dimensions (relevant to image structure) of the convolution kernel size K to 5 for tLSTMs.M is set to 100 for KTH and 200 for UCF101.The model performance is measured by three common metrics including MSE, Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index Measure (SSIM) [71], where SSIM ranges in [−1, 1] (larger is better).We set the mini-batch size to 16 and employ early stopping for training.All models are trained by observing 10 frames and predicting the next 10 frames.
Figure 7 shows the quantitative results.When L increases, sLSTM and 1-tLSTM-M improve their performances but get stuck at L = 5, while with the memConv, the performances of tLSTMs improve and finally exceed both sLSTM and 1-tLSTM-M.The effects of feedback and tensorisation become significant when L is large.Similar to the finding in [38] that LN is not suitable for normalising the convolution layer for images, the performance of 2-tLSTM+LN is even worse than 2-tLSTM.However, CN improves 2-tLSTM with different L. Unlike sLSTM where the runtime increases linearly w.r.t.L, tLSTM can keep its runtime largely unchanged when increasing L.
The best performing configuration is compared with the state-of-the-art methods (their source codes are publicly available) on both datasets (see Table 4).2-tLSTM+CN with L = 5 outperforms all existing models on KTH w.r.t.all metrics, and on UCF101 w.r.t.MSE and SSIM.Sampled qualitative results produced by 2-tLSTM+CN are shown in Figure 8.

Analysis
It can been seen from the experiments that one can boost the performance of tLSTMs by enlarging the tensorised size or increasing the model depth, whereas almost no extra parameters and runtime are required.The memConv is indispensable to maintain the performance improvement when the network gets wider and deeper.In addition, for tasks with sequential output, feedback connections are useful.In addition, tensorisation or CN can further strengthen the tLSTM.
To inspect the inner working of our tLSTM, the value of memory cells are visualised to show the information routing.On each task, we run the best performing tLSTM with a random sample (here we do not consider video prediction tasks where memory cells of the 2-tLSTM are 5D tensors, which are hard to visualise).At each time step, we record the memory cell's channel mean (computed by averaging along the channel dimension, for the 2-tLSTM, it has a size of P×P), and visualise its diagonal values from location p in = [1, 1] (close to the input) to p out = [P, P] (close to the output).
As shown in Figure 9, the visualisation result reveals different behaviors of tLSTM when handling different tasks:

•
Text Generation: If the next character is largely determined by the current input, the input content can be preserved with less modification when it arrives at the output location, and vice versa.

•
Addition: Two integers are gradually compressed into the memory and then interact with each other, generating their summation.

•
Copy: The model acts as a shift register, continuing to move the input symbols to their output locations.

•
Seq. MNIST: The model seems more sensitive to pixel value changes (which represent contours, or the digit topology); it gradually accumulates evidence to generate the final output.

•
Seq. pMNIST: The model seems more sensitive to high value pixels (which come from the digit); our conjecture is that the permutation has destroyed the digit topology, and thereby made each high value pixel potentially important.
In these tasks, there are also some common phenomena: • At each time step, different locations of the tensor possess markedly different values, which implies that a tensor of a larger size could encode more content, requiring less effort for compressing.• The value becomes more and more distinct from the input to output and is shifted along the time axis, which reveals that the model indeed simultaneously performs the deep and temporal computations, with the memory cell carrying the long-term dependency.

Conclusions
In this paper, we have aimed to deal with multimedia modelling tasks.We have introduced the tLSTM, where tensors are employed to share parameters and temporal computations are utilised to perform deep computations.The main advantage of our tLSTM over other popular methods is that its capacity can be increased with almost no extra parameters and runtime.Another important advantage of the tLSTM is that it can handle a variety of challenging multimedia modelling tasks well as shown in our experiments.
For future work, we would like to: (i) investigate more about the effect of higher-dimensional tensors, e.g., try 3-and 4-tLSTMs, (ii) try increasing the transition depth for tLSTM hidden states (similar to [47]); and (iii) apply tLSTMs to more multimedia modelling tasks such as machine translation, image generation, and video segmentation.

Figure 1 .
Figure 1.Illustration of the evolution from sRNN to tLSTM.(a) sRNN with three layers; (b) tRNN with no feedback connection (-F); it could be obtained by skewing the sRNN shown in (a); (c) the standard tRNN; (d) tLSTM with no memConv (-M); (e) The standard tLSTM.For each model, white circles from column 1-4 (from left to right) represent hidden states at time (t−1) to (t+2), respectively.Blue regions represent the output y t 's receptive fields.Note that, in (b and e), we have delayed the outputs for L−1 = 2 time steps, with a depth L = 3.

Figure 2 .
Figure 2. Illustration of how to generate the memConv kernels for 2D (a) and 3D (b) tensors.

Figure 3 .
Figure 3. Examples of the related models.(a) The cLSTM [7] with one layer, where the input at each time step is an array of vectors; (b) the sLSTM [45] with three layers; (c) the Grid LSTM [46] with three layers; (d) the recurrent highway network (RHN)[47] with three layers; (e) the quasi-recurrent neural network[48] with three layers and a kernel size of 2, where temporal convolution is utilised to parallelise costly computations.

Figure 4 .
Figure 4. Performance and runtime on Wikipedia.

Figure 5 .
Figure 5. Performance and runtime on text calculation tasks including addition (left) and copy (right).

Figure 8 .
Figure 8. Sampled qualitative results produced by 2-tLSTM+CN on KTH (sequences 1 to 3) and UCF101 (sequences 4 to 6).For each sequence, the first row shows the last five input frames (left) and the next 10 target frames (right), and the second row shows the next 10 predictions.All frames are shown with an aspect ratio of 4:3.

Figure A1 .
Figure A1.Illustration of calculating the constraint of L, P, and K.Each column is a concatenated hidden state tensor with tensorised size P+1 = 4 and channel size M.The volume of the output receptive field (blue region) is determined by the kernel radius K r .The output y t for current time step t is delayed by L−1 = 2 time steps.

Table 1 .
Test BPCs for Wikipedia text generation.

Table 2 .
Test accuracies for addition/copy.