Multimodal Tucker Decomposition for Gated RBM Inference

: Gated networks are networks that contain gating connections in which the output of at least two neurons are multiplied. The basic idea of a gated restricted Boltzmann machine (RBM) model is to use the binary hidden units to learn the conditional distribution of one image (the output) given another image (the input). This allows the hidden units of a gated RBM to model the transformations between two successive images. Inference in the model consists in extracting the transformations given a pair of images. However, a fully connected multiplicative network creates cubically many parameters, forming a three-dimensional interaction tensor that requires a lot of memory and computations for inference and training. In this paper, we parameterize the bilinear interactions in the gated RBM through a multimodal tensor-based Tucker decomposition. Tucker decomposition decomposes a tensor into a set of matrices and one (usually smaller) core tensor. The parameterization through Tucker decomposition helps reduce the number of model parameters, reduces the computational costs of the learning process and effectively strengthens the structured feature learning. When trained on afﬁne transformations of still images, we show how a completely unsupervised network learns explicit encodings of image transformations.


Introduction
Feature engineering is the process of transforming raw data into a suitable representation or feature vector that can be used to train a machine learning model for a prediction problem. For decades, the traditional approach to feature engineering was to manually build a feature extractor that required careful engineering and considerable domain expertise. Manual feature engineering required writing code that was problem-specific and that had to be adjusted for each new dataset. Deep learning allows computational models that are composed of multiple processing layers to learn representation of data with multiple levels of abstraction [1] (p. 436). Deep learning improves the process of feature engineering by automatically extracting useful and interpretable features. It also eliminates the need for domain expertise and hard core feature extraction by learning high-level features from the data in a hierarchical manner. The main building blocks in the deep learning literature are restricted Boltzmann machines [2,3], autoencoders [4,5], convolutional neural networks [6,7], and recurrent neural networks [8].
Most of these architectures are used to learn a relationship between a single input source and the corresponding output. However, there are many domains where the representation to be learned is the correspondence between more than one source and one output [9]. For instance, many tasks in vision carry the relevant information in the encoding of the relationship between observations, not the content of a single observation.
The above deep learning building blocks can be extended to contain gated connections and allow them to learn relationships between at least two sources of input and at least one output. A defining feature of the gated networks is that they contain gating connections. Unlike other networks, whose layer-to-layer connections are linear, gated networks introduce higher order interactions. The connection between two neurons x and y is in fact modulated by the activity of a third neuron h. Figure 1 illustrates two different approaches for the connection relationship between three neurons: to control the flow of information in the network or to model multiplicative interactions between several inputs. In the first type of connection, the neuron h is used as a switch or a gate that stops or does not stop the flow of information between x and y. In the second type of connection, the connection implements a multiplicative relationship between x and h, whose values are multiplied before being projected to the output y by the synaptic connection. In the multiplicative interaction, we can say that the neuron h modulates the signal between x and y. Despite the growing interest, the literature about gated networks is still sparse [9] (p. 2). The focus of this paper is the specific family of neural networks implementing a multiplicative gating relationship that are built on an RBM architecture. This concept of a gated restricted Boltzmann machine was first introduced in [10]. The basic idea of the gated model is to use the binary hidden units to learn the conditional distribution of one image (the input) given another image (the output). In [11], the authors revisit the problem and present a factorization alternative to the gated RBM. A gated RBM can be also considered a higher-order Boltzmann machine. As cited in [10] (p. 1474), Boltzmann machines that contain multiplicative interactions between more than two units are known in general as higher-order Boltzmann machines [12].
The remainder of this paper is organized as follows: in the following section, the RBM model and its popular training method Contrastive Divergence (CD) are presented. In Section 3, the standard gated RBM model is presented and a mechanism to reduce the number of its weights by projecting onto factor layers is discussed. Next, we present in Section 4 a small overview on the subject of tensors and the factorization known as Tucker decomposition. In Section 5, we introduce a multimodal tensor-based Tucker decomposition for the three-way parameter tensor in the gated RBM. In this section, we also show that by using Tucker Decomposition, we can use less than the cubically many parameters implied by the three-way weight tensor and introduce a Contrastive Divergence-based training procedure for the gated RBM, which reduces the number of model parameters and efficiently parameterizes its bilinear interactions. Experimental results and their corresponding discussion are reviewed in Section 6. Finally, conclusions are presented in Section 7.

Restricted Boltzmann Machines
A restricted Boltzmann machine is a type of graphical model in which the nodes x = {v, h} form a symmetrical bipartite graph with binary observed variables v ∈ {0, 1} n (visible nodes) and binary latent variables h ∈ {0, 1} m (hidden nodes). Each visible unit (node) is connected to each hidden unit, but there are no visible-to-visible or hidden-tohidden connections. Importantly, RBMs are able to model the probabilistic density of the joint distribution of visible and hidden units, enabling them to generate samples similar to those of the training data onto the visible layer. This type of model is called generative models. A classic RBM model is illustrated in Figure 2. An RBM is governed by an energy function, the energy of a joint configuration (v, h) between the visible layer, and the hidden layer is given by where θ = {W, b, c} are the parameters of the model. W ij is the connection weight matrix between the visible layer and the hidden layer, and b i and c j are biases of the visible layer and the hidden layer, respectively. E(v, h) is called the energy of the state (v, h). The joint probability distribution under the model is given by the Boltzmann distribution: where Z (θ) is a normalizing constant and τ is the thermodynamic temperature (often considered as 1). Z (θ) is called a partition function and is defined by summing over all possible visible and hidden configurations. Therefore, it is extremely hard to compute when the number of units is large. The partition function is represented as follows: Since there are no connections between two variables of the same layer, it is possible to derive an expression for p(v k = 1 | h); this is the probability of a particular visible unit being on given a hidden configuration: This is also true for a particular hidden unit given a visible configuration: where σ(z) = 1/(1 + exp(−z)). This leads to a block Gibbs sampling dynamics, used universally for sampling from RBMs.

RBM Training
Carreira-Perpinan and Hinton [13] showed that the derivative of the log-likelihood of the data under the RBM with respect to its parameters is: where · data denotes the expectation over the data, or in other words the distribution p(h | v (t) , θ). In the same way, · model denotes the expectation over the model distribution p(v, h | θ). However, directly calculating the sums that run over all values of v and h in the second term in (4) leads to a computational complexity, which is in general exponential in the number of variables. The · model expectation can be approximated with samples from the model distribution. These samples can be obtained via Gibbs sampling, iteratively sampling all units in one layer at once given the other layer using (2) and (3) alternately. However, this requires running the Markov chain for an infinite time to ensure convergence to a stationary state, making it an unfeasible solution.
Obtaining an unbiased sample of · model is extremely difficult. Hinton [3] approximates the second term in the true gradient in (4) by using an approximation of the derivative called Contrastive Divergence: the goal of CD is to replace the average · model with samples · k obtained after running k steps of Gibbs sampling starting from each data sample. This is illustrated in Figure 3. A typical value used in the literature is k = 1. Moreover, this way of updating the parameters has become a standard way of training RBMs. Although it has proven to work well in practice, CD does not yield the best approximation of the log-likelihood gradient [13,14]. There has been much research dedicated to better understanding this approach and the reasoning behind its success [13][14][15], leading to many variations being proposed from the perspective of improving the Markov chain Monte Carlo approximation to the gradient, namely, Persistent CD [16], Fast Persistent CD [17], and Parallel Tempering [18].

Gated RBM
Gated RBMs are a natural extension of RBMs, in which the gating idea is applied via the units of a hidden layer connecting/gating the neurons of two other layers (input and output). As in the RBM model, the gated RBM is governed by an energy function. Memisevic and Hinton [10] propose using the following three-way energy function that captures all possible correlations among the components of the x (input), y (output), and h (hidden) layers: where i, j and k index the units in the input, output and hidden layers, respectively; x i is the binary state of input pixel i; y j is the binary state of output pixel j; and h k is the binary state of hidden unit k. Figure 4 shows a fully connected gated RBM. The components W ijk of its three-way interaction tensor connect units x i , y j and h k and learn to weight the importance of the possible correlations given some training data. This type of multiplicative interaction among the input, output, and hidden units leads to a type of higher-order Boltzmann machine that retains the computational benefits of RBMs, such as being amenable to contrastive divergence training and allowing for efficient inference schemes that use alternating Gibbs sampling [11] (p. 1474).
In practice, to be able to model affine and not just linear dependencies, it is useful to add biases to the output and hidden units, which makes (5): where the terms ∑ k W h k h k and ∑ j W y j y j are bias terms used to model the base rates of activity of the hidden and output units, respectively. In general, a higher-order Boltzmann machine can also contain bias terms for the input units, but following [11], we do not use these. The negative energy −E(y, h; x) captures the compatibility between the input, output and hidden units. As in the RBM model, we can use this energy function to define the joint distribution p(y, h | x) over output and hidden variables by exponentiating and normalizing: where is a normalizing constant, which depends on the input image x. To obtain the distribution over output images, given the input, we marginalize and get: This marginalization over the hidden units is known in the literature as free energy. Note that p(y, h | x) or Z(x) cannot be computed exactly, since both contain sums over the exponentially large number of all possible instances of the hidden units and output units for Z(x). However, we do not actually need to compute any of these quantities to perform either inference or learning, as we shall see in the next section.
It is important to note that the normalization step in (8) is performed over h and y; thus, it defines the conditional distribution p(y, h | x) rather than the joint p(y, h, x). This is done deliberately to free the model from many of the independence assumptions that a fully generative model would need to make, hence simplifying inference and learning [10].
Inference then consists of guessing the transformation, or equivalently its encoding h, from a given pair of observed images x and y. Since the energy function does not contain interactions between any pairs of output units or pairs of hidden unis, it is possible to derive a closed-form expression for p(h k = 1 | x; y); this is the probability of a particular hidden unit when a input-output image pair is given: This is also true for the output units when input and hidden units are given: Note that in practice, function σ(·) can be changed for some other non-linear activation function. Regardless of the activation function, these models are called bilinear because, if one input is held fixed, the output is linear in the other input.
Consider now the task of predicting the hidden layerh given the input x and output y units, in such multiplicative network, this consists in computing all the valuesh k ofh using (10): Alternatively, one may computeỹ given the input x and hidden h units using (11): Memisevic and Hinton [10] point out that this type of three-way model can be interpreted as a mixture of experts. Note from (5) that in the way the energy is defined, the importance that each hidden unit h k attributes to the correlatedness (or anti-correlatedness) of a particular pair x i , y j is determined by W ijk .
To train the probabilistic mode,l we can use the same principle from Contrastive Divergence in the RBM model: we maximize the average conditional log-likelihood L = 1 N ∑ α log p(y α | x α ) for a set of training pairs {(x α , y α )}. The derivative of the (negative) log probability with respect to the weight parameter W ijk is given by the difference of two expectations: where · z denotes the expectation with regard to variable z. Note that the expectation in the first term in (14) is over the posterior distribution over hidden units, and it can be computed efficiently using (12). The expectation in the second term in (14) is over all possible output/hidden instantiations and is intractable. However, because of the conditional independences of h given x and y and y given x and h, we can easily sample from the conditional distributions p(h | x, y) and p(y | x, h). Using Gibbs sampling with Equations (12) and (13), respectively, for the hidden and output layer, we can approximate the intractable term.

Factorized Gated RBM
Memisevic and Hinton [11] propose a way of reducing the number of weights that consists of projecting the x, y, and h layers onto smaller layers, noted, respectively, as f x , f y , and f h before performing the product between these smaller layers. Given their multiplicative role, these layers are called factor layers.
The three-way tensor W ijk is constrained to use these projections; three factor layers f x , f y , and f h of the same size n f as is illustrated in Figure 5. Moreover, the weights W ijk are restricted to follow a specific form: With this constraint, the matrices W x , W y and W h are of respective size n x × n f , n y × n f and n h × n f ; thus the total number of weights is just n f × (n x + n y + n h ), which is quadratic instead of cubic in the size of input or factors.

Tucker Decomposition
Since tensors, specifically a three-way tensor, and its corresponding Tucker decomposition are at the core of this study, a brief overview of the subject of tensors is presented. First introduced by Tucker [19] and refined in subsequent articles by Levin [20] and Tucker et al. [21], the Tucker decomposition is a form of higher-order Principal Component Analysis. Tucker decomposition factorizes a tensor into a (usually smaller) core tensor and a set of factor matrices. One factor matrix along each mode. Then, in the three-way case where W W W ijk ∈ R I×J×K , we have where the operator × n denotes the mode-n multiplication of a tensor by a matrix in mode n. A ∈ R I×P , B ∈ R J×Q , and C ∈ R K×R are known as the factor matrices and can be thought of as the principal components in each mode. The tensor G G G ∈ R P×Q×R is called the core tensor, and its entries show the level of interaction between the different components. The Tucker decomposition of W W W ijk is usually summarized as: A comprehensive discussion on Tucker decomposition and tensor analysis is available in Kolda [22]. If G G G is the same size as W W W , the Tucker decomposition is simply a change of basis. More often, we are interested in using a change of basis to compress W W W . If P, Q, R are smaller than I, J, K, the core tensor G G G can be thought of as compressed version of W W W .
For some computations presented in this document, it is important to be able to transform the indices of a tensor so that it can be represented as a matrix and vice versa. Matricization, also known as unfolding or flattening, is the process of reordering the elements of a tensor (N-way array) into a matrix [23]. For instance, a 3 × 4 × 5 tensor can be rearranged as a 12 × 5 matrix or a 3 × 20 matrix, and so on.
The matricized forms (one per mode) of (16) are:

Materials and Methods
In this section, we propose a strategy for reducing the number of parameters in a gated RBM. First, we refactor the gated RBM model by applying a multimodal tensor-based Tucker decomposition to its three-way weight tensor. Then, we show that by using Tucker Decomposition, we can use fewer than the cubically many parameters implied in the model. Finally, we introduce a Contrastive Divergence-based training procedure for the tucker decomposed gated RBM, which efficiently parameterizes its bilinear interactions.

Decomposing the Three-Way Tensor in a Gated RBM
The central idea of this research is to represent the required three-way interaction tensor in the gated RBM model using far fewer parameters through its Tucker Decomposition. The energy function in a gated RBM (Equation (5)) captures all possible correlations among the components of the x (input), y (output), and h (hidden) layers. In this function, parameter W W W defines a three-way interaction tensor that learns the importance of correlations between layers x and y. However, despite its appealing modeling power, a fully parametrized gated RBM suffers from an explosion in the number of parameters, quickly becoming intractable because the size of the full tensor W W W is prohibitive using common dimensions for textual, visual, or output spaces.
As we will see, it is possible to use much fewer parameters by factorizing the multiway interaction tensor via Tucker decomposition. We can plug the Tucker decomposition Equation (16) into the energy function of the gated RBM Equation (5). Then, the energy of a joint configuration of the visible (input/output) and hidden units is defined as: Using the distributive law, this can be rewritten as: We can drop subindices for clarity and get: It is possible to simplify the notation in (21) if we define: Then the energy function in Equation (21) is given by: , h; x

Interpretation of the Refactored Model: Dimensionality Reduction
Let us consider the three-way tensor with shape (n i , n j , n k ) and its corresponding Tucker decomposition presented in Figure 6. As we parametrize the weights of the threeway tensor W W W with its Tucker decomposition, we are now able to separate W W W into four components, each having a specific role in the gated RBM model. Matrices A and B project the input (x) and output (y) images into spaces of respective dimension n p and n q . The core tensor G G G , whose shape is (n p , n q , n r ), is used to model the interactions between the input and output image projections. Finally, the matrix C projects the scores of the pair embedding h into a space of dimension n r .
Moreover, if G G G has the same shape as W W W , the Tucker decomposition is simply a change of basis. However, in our case, we are interested in using a change of basis to compress W W W . If n p , n q , n r are smaller than n i , n j , n k , the core tensor G G G can be thought of as compressed version of W W W . Note that the dimensions for the factor matrices A, B, and C are a result of the n-mode product between the original tensor W W W and the core tensor G G G . The factor matrix A has dimensions n i × n p as a result of the i-mode product between W W W and G G G . Respectively, factor matrix B has dimensions n j × n q (from j-mode product) and factor matrix C has dimensions n k × n r (from k-mode product). By constraining n p , n q , n r to be smaller than n i , n j , n k , we use a lower number of components for each of the three modes while at the same time linking these components to each other by means of the three-way core tensor. Again, consider the three-way tensor in Figure 6, whose respective cardinality of each layer is given by n i , n j and n k . If we consider n i ≈ n j ≈ 2048 and n k ≈ 2000, then the number of free parameters in the tensor W W W is ∼8.39 ×10 9 . It is easy to see that having such a number of free parameters is a problem both for memory and computing costs. In contrast, if we apply Tucker decomposition to this three-way tensor using a core tensor with shape 1024 × 1024 × 1000, then the number of free parameters would be ∼1.05 ×10 9 , which is given by the sum of parameters from the core tensor and the three factor matrices. By applying Tucker decomposition we reduce the dimensionality of the model. Note that the compression in the data is determined by the ranks of the core tensor.
Dimensionality reduction has long been an important technique for data representation. It reduces the space complexity of the underlying model so that it has higher stability when fitting, require fewer parameters and consequently becomes easier to interpret.

Training the Refactored Gated RBM
To train the refactored gated RBM, we can maximize its average conditional loglikelihood for a set of training pairs {(x α , y α )}. By substituting Equations (7) and (9), the derivative of the negative log probability with regard to any element θ of the parameter tensor is given by where · z denotes the average with regard to variable z. By substituting the reparametrized energy function from (21) into (23), we get Equation (24) calculates the derivative of the (negative) log probability with respect to any parameter in the refactored weights tensor: core tensor G G G and the three factor matrices A, B, and C.
Similar to the unfactored gated RBM model presented in Section 3, note that the derivative of the (negative) log probability with respect to any parameter of the Tucker refactored gated RBM is given by the difference of two expectations. The first expectation in (24) is over the posterior distribution over the hidden units. On the other hand, the second expectation in (24) is over all possible output/hidden instantiations and is intractable.
Note that the first term in (24) amounts to inferring the transformation (encoding) h from a given pair of observed inputs x and y as considered in Equation (12). It is possible to plug the Tucker refactored energy function from (22) into (12) (bias term dropped for clarity), which becomes:h In an analogous way, we may consider the task to compute p(ỹ | h, x) from an input image x and a given fixed transformation h considered in Equation (13). By plugging the Tucker decomposed energy function from (22) into (13) (bias term dropped for clarity), when input and hidden units are given, we get Let us now focus again on Equation (24). Note that the second term, also known as the model expectation, is an expectation over all possible instances of the output/mapping units and is intractable. However, similar to the bipartite structure in an RBM, the tripartite structure of a gated RBM facilitates Gibbs-sampling. With this in mind, we also consider the task to compute p(x | y, h).x Then, Gibbs sampling suggests itself as a way to approximate the intractable term in (24). Because of the conditional independences of h given y and h, y given x and h, and x given y and h, we can easily sample from the conditional distributions p(h | x, y), p(ỹ | h, x), and p(x | y, x) using (25)- (27) respectively.
Given the tripartite structure of the gated RBM, it is possible to perform three-way alternating Gibbs sampling. This scheme of optimizing an undirected graphical model is known as Contrastive Divergence. In this research, we perform a single Gibbs iteration when approximating the negative phase.
Using this Contrastive Divergence approach with Equations (25)- (27), we can make use of a machine learning library that supports reverse-mode automatic differentiation such as PyTorch.
PyTorch provides two high-level features: tensor computing with strong acceleration via GPU and automatic differentiation for all operations on tensors. Conceptually, this method for automatic differentiation records a graph recording all of the operations that created the data as the operations are executed. The leaves in the graph are the input tensors, and the roots are the output tensors. PyTorch traces this graph from roots to leaves, automatically computing the gradients using the chain rule. From a computational point of view, training a model consists of two phases: a forward pass to compute the value of the loss function and a backward pass to compute the gradients of the learnable parameters. With this in mind, we use the Contrastive Divergence approach for building the forward pass and generating the graph. This process is summarized in Algorithm 1.

Algorithm 1: Forward pass
Input: (x i , y i ): Training pair; k number of steps for CD learning Output: x k : input vector once CD-k is applied; y k : output vector once CD-k is applied 1.
Calculate negative phase For each step in k: (a) Calculate p(ỹ k | x 0 , h k ) using Equation (26). Sample the y k states Calculate p(x k | y 0 , h k ) using Equation (27). Sample the x k states Calculate p(h k | x 0 , y k ) using Equation (25). Sample the h k states When computing the forward pass, PyTorch simultaneously performs the requested computations and builds up a graph representing the function that computes the gradient. Once the forward pass is completed, this graph is evaluated in the backward pass to compute the gradients. To build the backward pass, we used the concept of free energy as presented in (9). Under the gated RBM, the probability of observing a configuration of output units y given the input units x can be obtained by marginalizing out the hidden units. This computation is called free energy and is given by: Generally speaking, RBMs and gated RBMs are a class of models that belongs to the more general class of energy-based models (EBMs) [24]. EBMs have probabilistic equations of the following form: where x is the observed variables, Z is the normalization term, and G is the energy function. Since (28) yields the same form for the equation describing an EBM (29), we can then minimize the free energy function for maximum (log-)likelihood. Unfortunately, p(y | x) is intractable to compute since Z(x) involves an integration/sum over all possible settings of the input and hidden units: However, log p(y | x) can be computed up to a constant, which is useful for scoring observation under a fixed model. First, we notice that Equation (28) involves a sum over all possible configurations of the hidden units ∑ h , but we observe that the hidden units are binary. This means that we only need to consider two possible states for each unit. This observation leads to a non intractable form of the energy function: exp(h H W ::K xy) = (1 + exp(W ::1 xy)) · · · (1 + exp(W ::K xy))/Z = exp(log(1 + exp(W ::1 xy))) · · · exp(log(1 + exp(W ::K xy)))/Z = exp For scoring observations under a model, we can ignore the partition function Z. Finally, the backward pass can be computed using the free energy function in its scoring version (no partition function Z) as it is shown in Algorithm 2.

Algorithm 2: Backward pass
Input: x 0 : input vector at step 0 y 0 : output vector at step 0 x k : input vector once CD-k is applied y k : output vector once CD-k is applied.
Calculate the free energy with F(x k , y k ) using Equation (30) 3.
Calculate the difference between the free energy F(x 0 , y 0 ) and F(x k , y k )

Results and Discussion
To illustrate the performance and viability of the model, we conducted experiments on pairs of shifted random binary images using the dataset provided on the accompanying website of Memisevic and Hinton [11].
We trained the model on pairs of transformed image patches, forcing the hidden variables to encode the transformations. The goal of this experiment is to investigate what forms the model weights take on when trained on affine transformations. The dataset consists of 10,000 binary image patch pairs of size 13 × 13 pixels each, where the output image in each pair is a transformed version of the input image. The input images are shifted by one pixel in a random direction in each sample. As stated in Memisevic and Hinton [11] (p. 1482), there is no structure at all in the images themselves, which are composed of random pixels. The only source of structure in the data comes from the way the images transform. Figure 7 shows three different samples of the binary images in the upper row. The lower row shows the shifted binary samples. This dataset was generated by a set of initial images where each pixel in the image is turned on randomly with probability 0.1. These initial images are used as input for the input layer x in the gated RBM model. Then, a random direction is chosen from the set {up, down, left, right, up-left, up-right, down-left, down-right, no shift} and each initial image is shifted by one pixel to create the output images. The newly appearing edges are filled randomly and independently as before with probability 0.1. The shifted images are used as input for the output layer y in the gated RBM model. For this task, we trained a gated RBM with the proposed training algorithm from Section 5.3. We performed parameter exploration via grid search considering the following parameters: core tensor ranks, number of units in the hidden layer, and learning rate. We ranged the value of hidden units from 64 to 144 units. We did not identify significant changes in the filters learned given the number of units in the hidden layer or the core tensor ranks. The reason is that much of the interactions between the input and output layers are captured in the core tensor. The learning rate ranged from 1 × 10 −2 to 1 × 10 −4 , and the core tensor ranks evaluated were 80 × 80 × 80, 120 × 120 × 120 and 169 × 169 × 169.
In Figure 8, we display the filters learned at different stages by the input, output, and hidden layers as a qualitative assessment of the trained gated RBM model. The filters displayed in Figure 8 correspond to a model with a core tensor with ranks 120 × 120 × 120, 144 units in the hidden layer, and a learning rate of 1 × 10 −3 . In the figure, each column in the factor matrices is rearranged to be displayed as a square. For example, if factor matrix A is of shape 120 × 169, then 120 squares of shape 13 × 13 are displayed. From iteration 0 in Figure 8, we observe that the weights for each layer are randomly initialized. On each iteration, the weights become more structured with no supervision in the model. Note that the filters presented in iteration 211 in Figure 8 do not totally resemble the filters found in [11] although the same dataset of random transformations is used. The reason is that the factorization of the three-way tensor proposed in this research is different to the one presented in [11]. While Memisevic and Hinton [11] project each mode of the three-way tensor in the gated RBM onto smaller layers, the current factorization involves a core tensor that models the interactions between the input and output image projections modulated via the gated connections in the hidden layer. In fact, the core tensor also learns filters, as presented in Figure 9. Moreover, the filters learned by each layer of the model show a correspondence with the filters in the three-way core tensor. In other words, each unit in the model is connected to the core tensor with variable strengths, and this allows them to detect specific changes in frequency, orientation, and phase shift in the data. In Figure 9, we show the filters learned by a model with a core tensor of shape 49 × 49 × 49. In the image, we display each frontal slice of the core tensor. For simplicity, we present the frontal slices for this model with a smaller tensor, but the same behavior was observed regardless of the tensor shape. We also confirmed that the learning rate would affect convergence. The learning rate controls how much the parameters in the model are adjusted with respect to the energy loss gradient. Figure 10 shows energy loss during training for different learning rate configurations: 1 × 10 −2 , 1 × 10 −3 , and 1 × 10 −4 . We limit the number of epochs to 10. The number of units in the hidden layer is 144, and the ranks selected for the core tensor is 120 × 120 × 120. As can be observed in the figure, the largest learning rate provides the fastest convergence in less than one epoch. On the other hand, the smallest learning rate decreases steadily on each epoch but does not arrive to a point of convergence. In each case, we observed that the corresponding model learns filters at different stages. In general, a smaller learning rate requires more training epochs given the smaller changes made to the model parameters in each update. However if the learning rate is too small, it can cause the learning process to get stuck and not converge. On the other hand, a larger learning rate results in rapid changes and requires fewer training epochs. However, it is possible that too large a learning rate causes the model to converge to a sub-optimal solution. The rate at which the learning takes place is an important hyper-parameter and will vary depending on the application of the model. To gain insight into the way the core tensor affects the learning speed, we also trained the gated RBM model on the random-shifts dataset under various configurations of the core tensor while holding the learning rate and the number of hidden units constant. The gated RBM evaluated has 169 units both in its input and output layer and 169 units in its hidden layer. With this configuration, the original weight tensor has a shape of 169 × 169 × 169. On the other hand, the core tensor was tested with ranks 80 × 80 × 80, 120 × 120 × 120, and 169 × 169 × 169. Note that in the last configuration, the core tensor provides no compression since it has the same shape as the original tensor. For simplicity, we only considered cubical tensors. The models were trained using the same learning rate of 1 × 10 −3 for 10 epochs and were not stopped via early stopping. The idea is to isolate the effect of the core tensor configurations. Figure 11 shows the energy loss when training under the three core tensor configurations. Note that in each case, the energy loss decreases steadily although with different gradients. We should remember that in a gated RBM as well as in an RBM, the concept of the energy function is analogous to the cost function. In fact, the configuration with the smallest core tensor has the highest energy decrease, while the core tensor configuration that is the exact shape of the original tensor has the lowest energy decrease. The reason is because the core tensor does not provide any compression but only functions as a change of basis. In addition, note that different configurations of the core tensor explore different configuration of the energy loss, which is explained by the fact that the core tensor modulates the level of interaction between the different components of each layer, resulting in different configurations of energy.
In Figure 12, we show the training time (in seconds) for the same core tensor configurations. As was expected, this figure confirms that the training speed is a function of the core tensor dimensionality. Although it could be inferred from Figures 11 and 12 that the best model configuration is the one with the smallest core tensor, it should be noted that the core tensor ranks should be calibrated according to the specifics of the problem to solve and evaluated according to the desired performance of the model.
As a last evaluation, we compared the Tucker refactored gated RBM proposed in this research against the unfactored model presented in [10], which uses a full three-way tensor. The filters learned by the input, output, and hidden layers in combination with the filters learned by the core tensor are highly structural and in fact are very similar to the ones that are applied to the output images in [10], as shown in Figure 13b. The latter produces a canvas-like effect on the image. In both cases, the filters are learned in an unsupervised fashion. We should underline that there is no available code accompanying the research in [10] and that we implemented their unfactored model using the update rules provided in the paper using PyTorch.
Although the filters produced both for the Tucker refactored and unfactored models might be similar, the training for both approaches has different performance metrics. To better understand this, we trained both models on the same random shifts dataset while holding all the variables constant. The input and output images have a size of 13 × 13 pixels each, which determines the number of neurons in the input and output layers to be 169 in both models. The number of hidden units and learning rate are kept the same in both implementations and have values of 121 and 1 × 10 −3 , respectively. Figure 14 shows the energy loss for the Tucker refactored, while Figure 15 shows the energy loss for the unfactored model.   The only difference between the two models is that the unfactored gated RBM uses a full tensor 169 × 121 × 169, and the Tucker refactored model factorizes the full tensor into a core tensor 80 × 80 × 80, input and output factor matrices 169 × 80 and a hidden factor matrix 121 × 80. Note that the unfactored gated RBM starts with a much higher energy configuration. On the other hand, the Tucker refactored model starts at a much smaller energy configuration as a result of the core tensor modulating the interactions of the three layers. The relationship between the core tensor configuration and the resulting energy configuration explored is also observable in Figure 14. From Figures 14 and 15, we see that the Tucker refactored gated RBM reaches convergence in fewer iterations than the unfactored gated RBM while having the same learning rate and hidden units. Finally, in Figure 16, we show a comparison of the training time (in seconds) for both the Tucker refactored and unfactored gated RBMs. It is easy to see that the unfactored model using a full tensor takes much longer than the Tucker decomposed factored model proposed in this research. This is because the core tensor reduces the number of free parameters in the model while maintaining its learning capacity.

Conclusions
The multimodal tensor-based Tucker decomposition presented in this research has the useful property that it keeps the independent structure in the gated RBM model intact. We take advantage of this independent structure by developing a contrastive-divergence-based training procedure used for inference and learning.
In this paper, we combine Tucker Decomposition for a gated RBM and showed how the model allows us to obtain image filter pairs that are highly structural when trained on transformations of images. There is some resemblance between our approach and the bilinear model for Visual Question Answering (VQA) task proposed in [25]; however, the problems solved and the learning methods presented are quite different. Despite its appealing modeling power, the literature on gated RBMs is still scarce. The literature review on gated RBMs revealed that almost all the publications on the subject present only different applications for the model, namely texture modeling [26], classification [27], and rotation representations [28]. In that sense, this research contributes an alternative method for training the gated RBM.
As we have seen, Tucker decomposition is a great tool for multidimensional data dimensionality reduction. Implementing it in the gated RBM means adding an additional hyperparameter to the model: the multimodal shape of the core tensor. In fact, the resulting model allows us to explicitly control the model complexity and to choose an accurate and interpretable factorization of the learnable parameters. One important property of tensor decomposition is that the number of parameters from the core tensor and factor matrices is usually much smaller than the number of parameters in the original tensor. Using the compression ratio (number of elements before and after tensor decomposition) and the approximation error caused by tensor decomposition, we can evaluate the proper decomposition to be performed.
Time complexity for inference or one-step inference in the Tucker decomposed gated RBM is O(n p n q n r ), where n p , n q , and n r correspond to the dimensionality of each mode in the core tensor and directly impact the modeling complexity that will be allowed for each modality. Note that the core tensor can be fixed to a small rank regardless of the dimensionality of each layer in the weight tensor. This is in contrast to the time complexity for inference in a fully parameterized gated RBM, which is O(n i n j n k ). The latter is in terms of the dimensionality of the input, output, and hidden layers in the fully parametrized gated RBM and is not efficient in terms of memory with large inputs.
Strictly speaking, we could also select the dimension of each modality n p , n q , n r to be equal to or greater than n i , n j , n k . This would lead to a change of basis or an explosion in the number of free parameters. There are several interesting directions for future work. In this research, we used a fixed shape in the core tensor; however, fine tuning this new hyperparameter needs to be addressed in future research. A possible idea is to measure the approximating error and select the smallest multimode dimensionality that meets a selected threshold. Another idea is to use the tripartite structure of the gated RBM to model discriminative tasks in which an image on layer x and its corresponding target in layer y are provided.

Data Availability Statement:
The implemented code that supports the findings of this research are openly available at https://github.com/macma80/tucker_gatedrbm (accessed on 6 August 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

RBM Restricted Boltzmann machine CD
Contrastive Divergence GPU Graphics processing unit EBM Energy-based model VQA Visual Question Answering