1. Introduction
Supervised deep learning models have recently achieved numerous breakthrough results in various applications, for example, Image Classification [
1,
2,
3], Object Detection [
4,
5,
6], Face Recognition [
7,
8,
9,
10,
11,
12,
13,
14], Image Segmentation [
15,
16] and Generative Model [
17,
18,
19,
20,
21,
22]. However, these methods usually require a huge number of annotated data, which is highly expensive. In order to tackle the requirement of large annotations, generative models have become a feasible solution. The main objective of generative models is to learn the hidden dependencies that exist in the realistic data so that they can extract meaningful features and variable interactions to synthesize new realistic samples without human supervision or labeling. Generative models can be used in numerous applications such as anomaly detection [
23], image inpainting [
24], data generation [
20,
25], super-resolution [
26], face synthesis [
22,
27,
28], and so forth. However, learning generative models is an extremely challenging process due to high-dimensional data.
There are two types of generative models extensively deployed in recent years, including likelihood-based methods [
29,
30,
31,
32] and Generative Adversarial Networks (GANs) [
33]. Likelihood-based methods have three main categories: Autoregressive models [
30], variational autoencoders (VAEs) [
34], and flow-based models [
29,
31,
32]. The flow-based generative model is constructed using a sequence of invertible and tractable transformations, the model explicitly learns the data distribution and therefore the loss function is simply a negative log-likelihood.
The flow-based model was first introduced in [
31] and later extended in RealNVP [
32]. These methods introduced an affine coupling layer that is invertible and tractable based on Jacobian determinant. As the design of the coupling layers, at each stage, only a subset of data is transformed while the rest is required to be fixed. Therefore, they may be limited in flexibility. To overcome this limitation, coupling layers are alternated with less complex transformations that manipulate on all dimensions of the data. In RealNVP [
32], the authors use a fixed channel permutation using fixed checkerboard and channel-wise masks. Kingma et al. [
29] simplifies the architecture by replacing the reverse permutation operation on the channel ordering with invertible
convolutions.
However, the
convolutions are not flexible enough in these scenarios. It is extremely hard to compute the inverse form of the standard
convolutions, and this step usually produces high computational costs. There are prior approaches that design the invertible
convolutions by using emerging convolution [
35], periodic convolutions [
35], autoregressive flow [
36] or stochastic approximation [
37,
38,
39]. In this paper, we propose an approach to generalize an invertible
convolution to a more general form of
convolution. Firstly, we reformulate the standard convolution layer by shifting the inputs instead of the kernels. Then, we propose an invertible shift function that is a tractable form of Jacobian determinant. Through the experiments on CIFAR-10 [
40], ImageNet [
41] and Celeb-HQ [
42] datasets, we prove that our proposals are significant and efficient for high-dimensional data.
Figure 1 illustrates the advantages of our approach with high-resolution synthesized images.
Contributions: This work generalizes the invertible convolution to an invertible convolution by reformulating the convolution layer using our proposed invertible shift function. Our contributions can be summarized as follows:
Firstly, by analyzing the standard convolution layer, we reformulate its equation into a form such that, rather than shifting the kernels during the convolution process, shifting the input provides equivalent results.
Secondly, we propose a novel invertible shift function that mathematically helps to reduce the computational cost of the standard convolution while keeping the range of the receptive fields. The determinant of the Jacobian matrix produced by this shift function can be computed efficiently.
Thirdly, evaluations of several datasets on both objects and faces have shown the generalization of the proposed convolution using our proposed novel invertible shift function.
2. Related Work
The generative models can be divided into two groups, that is,
Generative Adversarial Networks and
Flow-based Generative Models. In the first group,
Generative Adversarial Networks [
33] provide an appropriate solution to model the data generation. The discriminative model learns to distinguish the real data from the fake samples produced using a generative model. Two models are trained as they are playing a mini-max game. Meanwhile, in the second group, the
Flow-based Generative Models [
29,
31,
32] are constructed using a sequence of
invertible and
tractable transformations. Unlike GAN, the model explicitly learns the data distribution
and therefore the loss function is efficiently employed with the log-likelihood.
In this section, we discuss several types of flow-based layers that are commonly used in flow-based generative models. An overview of several invertible functions is provided in the
Table 1. In particular, all functions easily obtain the reverse function and tractability of a Jacobian determinant. The symbols
denote element-wise multiplication and division.
denotes the height and width of the input/output.
are the depth channel index and spatial indices, respectively.
Coupling Layers: NICE [
31] and RealNVP [
32] presented coupling layers with a normalizing flow by stacking a sequence of invertible bijective transformation functions. The bijective function is designed as an affine coupling layer, which is a tractable form of Jacobian determinant. RealNVP can work in a multi-scale architecture to build a more efficient model for large inputs. To further improve the propagation step, the authors applied batch normalization and weight normalization during training. Later, Ho et. al. [
43] presented a continuous mixture cumulative distribution function to improve the density modeling of coupling layers. In addition to improving the expressiveness of transformations of coupling layers [
43], utilized multi-head self-attention layers [
44] in the transformations.
Inverse Autoregressive Convolution: Germain et al. [
45] introduced autoregressive autoencoders by constructing an extension of a non-variational autoencoder that can estimate distributions and is straightforward in computing their Jacobian determinant. Masked autoregressive flow [
36] is a type of normalizing flow, where the transformation layer is built as an autoregressive neural network. Inverse autoregressive flow [
30] formulates the conditional probability of the target variable as an autoregressive model.
Invertible Convolution: Kingma et al. [
29] proposed simplifying the architecture via invertible
convolutions. Learning a permutation matrix is a discrete optimization that is not amenable to gradient ascent. However, the permutation operation is simply a special case of a linear transformation with a square matrix. We can pursue this work with convolutional neural networks, as permuting the channels is equivalent to a
convolution operation with an equal number of input and output channels. Therefore, the authors replace the fixed permutation with learned
convolution operations.
Activation Normalization: [
29] performs an affine transformation using scale and bias parameters per channel. This layer simply shifts and scales the activations with data-dependent initialization that normalizes the activations given an initial minibatch of data. This allows the scaling down of the minibatch size to 1 (for large images) and the scaling up of the size of the model.
Invertible Convolution: Since the invertible
convolution is not flexible, Hoogeboom et al. [
35] proposed an invertible
convolution generalized from the
convolutions. The authors presented two methods to produce the invertible convolutions: (1)
Emerging Convolution and (2)
Invertible Periodic Convolutions. Emerging Convolution is obtained by chaining specific invertible autoregressive convolutions [
30] and speeding up this layer through the use of an accelerated parallel inversion module implemented in Cython. Invertible Periodic Convolutions transform data to the frequency domain via Fourier transform; this alternative convolution has a tractable Jacobian determinant and inverse. However, these invertible
convolutions require more parameters; therefore, these have an additional computational cost compared to our proposed method.
Lipschitz Constant: Behrmann et al. [
37] developed a theory that any residual blocks satisfying the Lipschitz Constant can be invertible. Hence, Behrmann et al. proposed an invertible residual network (i-ResNet) as a normalizing flow-based model. Similar to [
29,
31,
32,
35], i-ResNet is learned by optimizing the negative log-likelihood in which the inverse flow and Jacobian determinant of the residual block can be efficiently approximated by the stochastic methods. Inheriting the success of Lipschitz theory, Kim et al. [
38] proposed an
self-attention that allows the self-attention of the Transformer networks [
44] to be invertible.
6. Conclusions and Future Work
This paper has presented a novel invertible convolution approach. By reformulating the convolution layer, we propose to use the shift function to shift inputs instead of kernels. We prove that our shift function is invertible and tractable in terms of calculating the Jacobian determinant. The method leverages the shift function and the invertible convolution to generalize to the invertible convolution. Through experiments, our proposal has achieved state-of-the-art results in quantitative measurement and is able to generate realistic images with high-resolution.
There are several challenges that remain to be addressed in future work. In particular, when the model scales up to the high-resolution images, it requires a large amount of GPU memory during the training process, that is, the back-propagation process. Maintaining the rotation matrix property for the invertible convolution when training the model on a large dataset is also a challenging task, since the model easily falls into the non-inverse matrix due to the stochastic gradient update of the back-propagation algorithm. That issue is interesting work and should be improved in the future.