# Fast Flow Reconstruction via Robust Invertible n × n Convolution

^{1}

^{2}

^{3}

^{*}

Next Article in Journal

Next Article in Special Issue

Next Article in Special Issue

Previous Article in Journal

Previous Article in Special Issue

Previous Article in Special Issue

Computer Vision and Image Understanding Lab, Department of Computer Science and Computer Engineering, University of Arkansas, Fayetteville, AR 72501, USA

Department of Computer Science and Software Engineering, Concordia University, Montreal, QC H3G 2V4, Canada

Faculty of Information Technology, University of Science, VNU-HCM, Ho Chi Minh 721337, Vietnam

Author to whom correspondence should be addressed.

Academic Editor: Massimo Cafaro

Received: 31 May 2021
/
Revised: 29 June 2021
/
Accepted: 6 July 2021
/
Published: 8 July 2021

(This article belongs to the Collection Machine Learning Approaches for User Identity)

Flow-based generative models have recently become one of the most efficient approaches to model data generation. Indeed, they are constructed with a sequence of invertible and tractable transformations. Glow first introduced a simple type of generative flow using an invertible $1\times 1$ convolution. However, the $1\times 1$ convolution suffers from limited flexibility compared to the standard convolutions. In this paper, we propose a novel invertible $n\times n$ convolution approach that overcomes the limitations of the invertible $1\times 1$ convolution. In addition, our proposed network is not only tractable and invertible but also uses fewer parameters than standard convolutions. The experiments on CIFAR-10, ImageNet and Celeb-HQ datasets, have shown that our invertible $n\times n$ convolution helps to improve the performance of generative models significantly.

Supervised deep learning models have recently achieved numerous breakthrough results in various applications, for example, Image Classification [1,2,3], Object Detection [4,5,6], Face Recognition [7,8,9,10,11,12,13,14], Image Segmentation [15,16] and Generative Model [17,18,19,20,21,22]. However, these methods usually require a huge number of annotated data, which is highly expensive. In order to tackle the requirement of large annotations, generative models have become a feasible solution. The main objective of generative models is to learn the hidden dependencies that exist in the realistic data so that they can extract meaningful features and variable interactions to synthesize new realistic samples without human supervision or labeling. Generative models can be used in numerous applications such as anomaly detection [23], image inpainting [24], data generation [20,25], super-resolution [26], face synthesis [22,27,28], and so forth. However, learning generative models is an extremely challenging process due to high-dimensional data.

There are two types of generative models extensively deployed in recent years, including likelihood-based methods [29,30,31,32] and Generative Adversarial Networks (GANs) [33]. Likelihood-based methods have three main categories: Autoregressive models [30], variational autoencoders (VAEs) [34], and flow-based models [29,31,32]. The flow-based generative model is constructed using a sequence of invertible and tractable transformations, the model explicitly learns the data distribution and therefore the loss function is simply a negative log-likelihood.

The flow-based model was first introduced in [31] and later extended in RealNVP [32]. These methods introduced an affine coupling layer that is invertible and tractable based on Jacobian determinant. As the design of the coupling layers, at each stage, only a subset of data is transformed while the rest is required to be fixed. Therefore, they may be limited in flexibility. To overcome this limitation, coupling layers are alternated with less complex transformations that manipulate on all dimensions of the data. In RealNVP [32], the authors use a fixed channel permutation using fixed checkerboard and channel-wise masks. Kingma et al. [29] simplifies the architecture by replacing the reverse permutation operation on the channel ordering with invertible $1\times 1$ convolutions.

However, the $1\times 1$ convolutions are not flexible enough in these scenarios. It is extremely hard to compute the inverse form of the standard $n\times n$ convolutions, and this step usually produces high computational costs. There are prior approaches that design the invertible $n\times n$ convolutions by using emerging convolution [35], periodic convolutions [35], autoregressive flow [36] or stochastic approximation [37,38,39]. In this paper, we propose an approach to generalize an invertible $1\times 1$ convolution to a more general form of $n\times n$ convolution. Firstly, we reformulate the standard convolution layer by shifting the inputs instead of the kernels. Then, we propose an invertible shift function that is a tractable form of Jacobian determinant. Through the experiments on CIFAR-10 [40], ImageNet [41] and Celeb-HQ [42] datasets, we prove that our proposals are significant and efficient for high-dimensional data. Figure 1 illustrates the advantages of our approach with high-resolution synthesized images.

- Firstly, by analyzing the standard convolution layer, we reformulate its equation into a form such that, rather than shifting the kernels during the convolution process, shifting the input provides equivalent results.
- Secondly, we propose a novel invertible shift function that mathematically helps to reduce the computational cost of the standard convolution while keeping the range of the receptive fields. The determinant of the Jacobian matrix produced by this shift function can be computed efficiently.
- Thirdly, evaluations of several datasets on both objects and faces have shown the generalization of the proposed $n\times n$ convolution using our proposed novel invertible shift function.

The generative models can be divided into two groups, that is, Generative Adversarial Networks and Flow-based Generative Models. In the first group, Generative Adversarial Networks [33] provide an appropriate solution to model the data generation. The discriminative model learns to distinguish the real data from the fake samples produced using a generative model. Two models are trained as they are playing a mini-max game. Meanwhile, in the second group, the Flow-based Generative Models [29,31,32] are constructed using a sequence of **invertible** and **tractable** transformations. Unlike GAN, the model explicitly learns the data distribution $p\left(\mathbf{x}\right)$ and therefore the loss function is efficiently employed with the log-likelihood.

In this section, we discuss several types of flow-based layers that are commonly used in flow-based generative models. An overview of several invertible functions is provided in the Table 1. In particular, all functions easily obtain the reverse function and tractability of a Jacobian determinant. The symbols $\odot ,/$ denote element-wise multiplication and division. $h,w$ denotes the height and width of the input/output. $c,i,j$ are the depth channel index and spatial indices, respectively.

Let $\mathbf{x}$ be a high-dimensional vector with unknown true distribution $\mathbf{x}\sim {p}_{\mathcal{X}}\left(\mathbf{x}\right)$, $x\in \mathcal{X}$, a simple prior probability distribution ${p}_{\mathcal{Z}}$ on a latent variable $z\in \mathcal{Z}$, a bijection $f:\mathcal{X}\to \mathcal{Z}$, the change of variable formula defines a model distribution on $\mathcal{X}$ as shown in Equation (1).
where $\frac{\partial f\left(x\right)}{\partial x}$ is the Jacobian of f at $\mathbf{x}$. The log-likelihood objective is then equivalent to minimizing:

$${p}_{\mathcal{X}}\left(\mathbf{x}\right)={p}_{\mathcal{Z}}\left(\mathbf{z}\right)|det\left(\frac{\partial f\left(\mathbf{x}\right)}{\partial \mathbf{x}}\right)|,$$

$$\begin{array}{cc}\hfill \mathcal{L}\left(\mathcal{X}\right)& ={-}_{\mathbf{x}\in \mathcal{X}}log{p}_{\mathcal{X}}\left(\mathbf{x}\right)\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& ={-}_{\mathbf{x}\in \mathcal{X}}\left[log{p}_{\mathcal{Z}}\left(\mathbf{z}\right)+log|det\left(\frac{\partial f\left(\mathbf{x}\right)}{\partial \mathbf{x}}\right)|\right].\hfill \end{array}$$

Since the data $\mathbf{x}$ are discrete data, we add a random uniform noise $u\in \mathcal{U}(0,a)$, where a is determined by the discretization level of the data, to make $\mathbf{x}$ be continuous data. The generative process can be defined as Equation (3).

$$\begin{array}{cc}\hfill \mathbf{z}& \sim {p}_{\mathcal{Z}}\left(\mathbf{z}\right)\hfill \\ \hfill \mathbf{x}& ={f}^{-1}\left(\mathbf{z}\right).\hfill \end{array}$$

The bijection function f is constructed from a sequence of invertible and tractable Jacobian determinant transformations: $f={f}_{1}\circ {f}_{2}\circ \dots \circ {f}_{K}$ (K is the number of transformations). Such a sequence of invertible transformations is also called a normalizing flow. Here, Equation (2) can be written as in Equation (4).
where ${\mathbf{h}}_{k}={f}_{1}\circ {f}_{2}\circ \dots \circ {f}_{k}\left({\mathbf{h}}_{0}\right)$ with ${\mathbf{h}}_{0}=\mathbf{x}$.

$$\begin{array}{cc}\hfill \mathcal{L}\left(\mathcal{X}\right)& ={-}_{\mathbf{x}\in \mathcal{X}}log{p}_{\mathcal{X}}\left(\mathbf{x}\right)\hfill \\ & ={-}_{\mathbf{x}\in \mathcal{X}}\left[log{p}_{\mathcal{Z}}\left(\mathbf{z}\right)+\sum _{k=1}^{K}log|det\left(\frac{\partial {\mathbf{h}}_{k}}{\partial {\mathbf{h}}_{k-1}}\right)|\right]\hfill \end{array}$$

In this section, we revisit the standard $n\times n$ convolution. Let $\mathbf{X}$ be an $C\times H\times W$ input; $\mathbf{W}$ is a $D\times C\times K$ kernel, and the convolution can be expressed as follows:
where ${\mathbf{X}}_{:,:,:}^{k}$ is a $C\times H\times W$ matrix that represents a spatially shifted version of input matrix $\mathbf{X}$ with shift amount $({i}_{k},{j}_{k})$,. ${\mathbf{W}}_{:,:,k}$ represents the $D\times C$ matrix corresponding to the kernel index k, the symbol 🟉 denotes a convolution operator.

$$\begin{array}{cc}\hfill \mathbf{Y}& =\mathbf{W}\U0001f7c9\mathbf{X}=\left[{\mathbf{W}}_{:,:,1}\phantom{\rule{0.277778em}{0ex}}{\mathbf{W}}_{:,:,2}\phantom{\rule{0.277778em}{0ex}}\cdots \phantom{\rule{0.277778em}{0ex}}{\mathbf{W}}_{:,:,k}\right]\times \left[\begin{array}{c}{\mathbf{X}}_{:,:,:}^{1}\\ {\mathbf{X}}_{:,:,:}^{2}\\ \vdots \\ {\mathbf{X}}_{:,:,:}^{K}\end{array}\right]\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\sum _{k=1}^{K}{\mathbf{W}}_{:,:,k}\times {\mathbf{X}}_{:,:,:}^{k}=\sum _{k=1}^{K}{\mathbf{W}}_{:,:,k}\times {\mathcal{S}}_{k}\left(\mathbf{X}\right),\hfill \end{array}$$

In Equation (5), the standard convolution is simply a sum of $1\times 1$ convolutions on shifted inputs. The function ${\mathcal{S}}_{k}$ maps the input $\mathbf{X}$ to the corresponding shifted input ${\mathbf{X}}_{:,:,:}^{k}$. The standard convolution uses the common shifted input with integer-valued shift amounts for index k. Figure 2 illustrates our reformulated $n\times n$ convolution, if we can share the shifted inputs regardless of the kernel index, especially ${\mathcal{S}}_{k}\left(\mathbf{X}\right)=\mathcal{S}\left(\mathbf{X}\right)$, the standard convolution will be simplified as the $1\times 1$ convolution as shown in Equation (6). In this paper, we propose a shift function $\mathcal{S}$, which is an invertible and tractable form of the Jacobian determinant.

$$\begin{array}{cc}\hfill \sum _{k=1}^{K}{\mathbf{W}}_{:,:,k}\times {\mathcal{S}}_{k}\left(\mathbf{X}\right)& =\sum _{k=1}^{K}{\mathbf{W}}_{:,:,k}\times \mathcal{S}\left(\mathbf{X}\right)\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\left(\sum _{k=1}^{K}{\mathbf{W}}_{:,:,k}\right)\times \mathcal{S}\left(\mathbf{X}\right).\hfill \end{array}$$

In this section, we first introduce our proposed Invertible Shift Function and then present invertible $n\times n$ convolution in details.

The shift function $\mathcal{S}$ will approximate all shifted input ${\mathbf{X}}_{:,:,:}^{k}$ ($1\le k\le K$). Here, we propose to design $\mathcal{S}$ as a linear transformation per channel; specifically, we have learnable variables ${\alpha}_{c},{\beta}_{c}$; $1\le c\le C$ are scale and translation parameters for each channel, respectively. The shift function $\mathcal{S}$ can be formulated as follows:
where $c,i,j$ are the depth channel index and spatial indices, respectively. The reverse function of $\mathcal{S}$ can be easy to obtain:

$$\mathcal{S}\left({\mathbf{X}}_{c,i,j}\right)={\alpha}_{c}{\mathbf{X}}_{c,i,j}+{\beta}_{c},$$

$${\mathbf{X}}_{c,i,j}=\frac{\mathcal{S}\left({\mathbf{X}}_{c,i,j}\right)-{\beta}_{c}}{{\alpha}_{c}}.$$

Thanks to Equation (7), the value of $\mathcal{S}\left({\mathbf{X}}_{c,i,j}\right)$ only depends on ${\mathbf{X}}_{c,i,j}$ and the Jacobian matrix will be in the form of the diagonal matrix as follows:

$$\begin{array}{cc}\hfill \mathbf{J}=\frac{\partial \mathcal{S}\left(\mathbf{X}\right)}{\partial \mathbf{X}}& =\left[\begin{array}{cccc}\frac{\partial \mathcal{S}\left({\mathbf{X}}_{1,1,1}\right)}{\partial {\mathbf{X}}_{1,1,1}}& 0& \cdots & 0\\ 0& \frac{\partial \mathcal{S}\left({\mathbf{X}}_{1,1,2}\right)}{\partial {\mathbf{X}}_{1,1,2}}& \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0& 0& \cdots & \frac{\partial \mathcal{S}\left({\mathbf{X}}_{C,H,W}\right)}{\partial {\mathbf{X}}_{C,H,W}}\end{array}\right]\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\left[\begin{array}{cccc}{\alpha}_{1}& 0& \cdots & 0\\ 0& {\alpha}_{1}& \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0& 0& \cdots & {\alpha}_{c}.\end{array}\right]\hfill \end{array}$$

Therefore, the determinant of Equation (9) is the product of all elements in the diagonal of the matrix $\mathbf{J}$ as in Equation (10).

$$\begin{array}{cc}\hfill det\left(\frac{\partial \mathcal{S}\left(\mathbf{X}\right)}{\partial \mathbf{X}}\right)& =\prod _{c=1}^{C}{\alpha}_{c}^{H\times W}\hfill \\ \hfill log|det\left(\frac{\partial \mathcal{S}\left(\mathbf{X}\right)}{\partial \mathbf{X}}\right)|& =H\times W\times \sum _{c=1}^{C}log\left|{\alpha}_{c}\right|.\hfill \end{array}$$

Kingma [29] proposed invertible $1\times 1$ convolution as the smart way to learn the permutation matrix instead of the fixed permutation [31,32]. However, the $1\times 1$ suffers from limited flexibility compared to the standard convolution. In particular, the receptive fields of $1\times 1$ convolution is limited. When the network goes deeper, the receptive fields of $1\times 1$ convolutions are still small areas; these, therefore, cannot generalize or model large objects of high-dimensional data. However, the $1\times 1$ convolution has its own advantages compared to the standard convolution. First, the $1\times 1$ convolution allows the network to compress the data of the input volume to be smaller. Second, $1\times 1$ suffers less over-fitting due to small kernel sizes. Therefore, in our proposal, we still take advantages of the $1\times 1$ convolution. Specifically, we adopt the successfully invertible $1\times 1$ convolution of Glow [29] in our design.

In the previous subsection, we proved that the shift function $\mathcal{S}$ is invertible and proved the tractability of the Jacobian determinant. In Section 3.2, we indicated that if we can share shifted inputs regardless of the kernel index via the shift function $\mathcal{S}$, we can simplify the standard $n\times n$ convolution to the composition of the $\mathcal{S}$ and $1\times 1$ convolution. Therefore, the invertible $n\times n$ convolution will be equivalent to the combination of the invertible shift function $\mathcal{S}$ and the invertible $1\times 1$ convolution. Specifically, the input will first be forwarded to the shift function $\mathcal{S}$ and then convoluted with the $1\times 1$ filter. Algorithm 1 illustrates the pseudo code of the invertible $n\times n$ convolution.

Algorithm 1: Invertible $n\times n$ Convolution |

Input: An input $\mathbf{X}\in {\mathbb{R}}^{N\times H\times W\times C}$ |

Result: An output of invertible $n\times n$ convolution and the log Jacobian determinant |

Initialize $\alpha ,\beta \in {\mathbb{R}}^{C}$ for the invertible shift function; |

Initialize $\mathbf{W}\in {\mathbb{R}}^{C\times C}$ as a rotation matrix for the invertible $1\times 1$ convolution function; |

logdet = 0.0; |

The invertible shift function; |

$\mathbf{Y}=\mathbf{X}\times \alpha +\beta $ (Channel-wise operations); |

The inverse will be $\mathbf{X}=\frac{\mathbf{Y}-\beta}{\alpha}$; |

logdet = logdet + ${\sum}_{i=1}^{C}log\left({\alpha}_{i}\right)$; |

The invertible$1\times 1$convolution; |

$\mathbf{Z}=Conv(\mathbf{Y},\mathbf{W})$; |

The inverse will be $\mathbf{Y}=Conv(\mathbf{Z},{\mathbf{W}}^{-1}$); |

logdet = logdet + $log(det(\mathbf{W}\left)\right)$ * H * W; |

Return $\mathbf{Z}$ and logdet; |

Figure 3a illustrates our one step of flow. We adopt the common design of a flow step [29,35,46] in our design. Our proposal can be easily integrated to the multi-scale architecture designed by Dinh et al. [32] (Figure 3b). From our proposal, we can generalize the invertible $1\times 1$ convolution to the invertible $n\times n$ convolution through the shift function $\mathcal{S}$. It can help to encourage the filters to learn a more efficient data representation and embed more useful latent features than the invertible $1\times 1$ convolution used in Glow [29]. Besides, we use fewer parameters and have less inference time compared to the standard $n\times n$ convolutions.

In this section, we present our experimental results on CIFAR-10, ImageNet and Celeb-HQ datasets. Firstly, in Section 5.1, we compare log-likelihood against the previous flow-based models, that is, RealNVP [32], Glow [29] and Emerging Convolution [35]. Finally, in Section 5.2, we show our qualitative results trained on the Celeb-HQ dataset.

The shift function $\mathcal{S}$ will be not inverse if the ${\alpha}_{c}=0$ ($\exists \phantom{\rule{0.277778em}{0ex}}c\in [1\dots C]$). Hence, in the training process, we will first initialize ${\alpha}_{c}=1\phantom{\rule{4.pt}{0ex}}\mathrm{and}\phantom{\rule{4.pt}{0ex}}{\beta}_{c}=0$ ($1\le c\le C$). During the learning processing, we keep ${\alpha}_{c}$ ($1\le c\le C$) as a different 0 to guarantee that the shift function $\mathcal{S}$ is inverse and to guarantee the tractability of the Jacobian determinant. Training models on high-dimensional data requires large memory. To be able to train with a large batch size, we simultaneously and distributively trained the models on four GPUs via Horovod (https://github.com/horovod/horovod, accessed on 8 July 2021) and TensorFlow (https://tensorflow.org, accessed on 8 July 2021) frameworks.

The CelebA-HQ dataset [42] was selected to train the model using the architectures defined in the previous section with a higher resolution ($256\times 256$ image sizes). The depth of flow K and the number of levels L were set to 32 and 6, respectively. Since high-dimensional data requires large memory, we reduced the batch size to 1 (per GPU) and trained on eight GPUs. The qualitative experiment aims to study the efficiency of the model when it scales up to the high-resolution images, synthesizes realistic images, and provides the meaningful latent space. Figure 4c shows the examples of Celeb-HQ datasets. We trained our model on 5-bit images in order to improve visual quality with a slight trade-off of color fidelity. As shown by the synthetic images in Figure 5, our model can generalize realistic images in high dimensional data.

This paper has presented a novel invertible $n\times n$ convolution approach. By reformulating the convolution layer, we propose to use the shift function to shift inputs instead of kernels. We prove that our shift function is invertible and tractable in terms of calculating the Jacobian determinant. The method leverages the shift function and the invertible $1\times 1$ convolution to generalize to the invertible $n\times n$ convolution. Through experiments, our proposal has achieved state-of-the-art results in quantitative measurement and is able to generate realistic images with high-resolution.

There are several challenges that remain to be addressed in future work. In particular, when the model scales up to the high-resolution images, it requires a large amount of GPU memory during the training process, that is, the back-propagation process. Maintaining the rotation matrix property for the invertible $1\times 1$ convolution when training the model on a large dataset is also a challenging task, since the model easily falls into the non-inverse matrix due to the stochastic gradient update of the back-propagation algorithm. That issue is interesting work and should be improved in the future.

Conceptualization: T.-D.T. and C.N.D. Methodology: T.-D.T., C.N.D. and K.L. Review and Editing: K.L., M.-T.T., and N.L. Supervision: K.L. and M.-T.T. All authors have read and agreed to the published version of the manuscript.

This research was funded by National Science Foundation (NSF).

Not applicable.

Not applicable.

CIFAR Dataset https://www.cs.toronto.edu/~kriz/cifar.html, accessed on 8 July 2021, ImageNet dataset https://image-net.org/, accessed on 8 July 2021, and CelebA-HQ Dataset https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, accessed on 8 July 2021.

This work is partially supported by NSF EPSCoR Track-1 Data Science, Data Analytics that are Robust and Trusted (DART), NSF Track-2 CRESH, and NSF 19-554 Small Business Innovation Research Program. The authors would like to thank the reviewers for their valuable comments.

The authors declare no conflict of interest.

- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Sun, S.; Pang, J.; Shi, J.; Yi, S.; Ouyang, W. FishNet: A Versatile Backbone for Image, Region, and Pixel Level Prediction. Available online: https://arxiv.org/abs/1901.03495 (accessed on 8 July 2021).
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Lecture Notes in Computer Science Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Luu, K.; Seshadri, K.; Savvides, M.; Bui, T.; Suen, C. Contourlet Appearance Model for Facial Age Estimation. In Proceedings of the 2011 International Joint Conference on Biometrics (IJCB), Washington, DC, USA, 11–13 October 2011. [Google Scholar]
- Le, H.; Seshadri, K.; Luu, K.; Savvides, M. Facial Aging and Asymmetry Decomposition Based Approaches to Identification of Twins. Pattern Recognit.
**2015**, 48, 3843–3856. [Google Scholar] [CrossRef] - Xu, F.; Luu, K.; Savvides, M. Spartans: Single-sample Periocular-based Alignment-robust Recognition Technique Applied to Non-frontal Scenarios. IEEE Trans. Image Process.
**2015**, 12, 4780–4795. [Google Scholar] [CrossRef] - Xu, J.; Luu, K.; Savvides, M.; Bui, T.; Suen, C. Investigating Age Invariant Face Recognition Based on Periocular Biometrics. In Proceedings of the 2011 International Joint Conference on Biometrics (IJCB), Washington, DC, USA, 11–13 October 2011. [Google Scholar]
- Duong, C.; Quach, K.; Luu, K.; Le, H.K. Fine Tuning Age Estimation with Global and Local Facial Features. In Proceedings of the 36th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011. [Google Scholar]
- Luu, K.; Bui, T.K.; Suen, C. Age Estimation using Active Appearance Models and Support Vector Machine Regression. In Proceedings of the 2009 IEEE 3rd International Conference on Biometrics: Theory, Applications, and Systems, Washington, DC, USA, 28–30 September 2009. [Google Scholar]
- Luu, K.; Bui, T.; Suen, C. Kernel Spectral Regression of Perceived Age from Hybrid Facial Features. In Proceedings of the 2011 IEEE International Conference on Automatic Face and Gesture Recognition (FG), Santa Barbara, CA, USA, 21–25 March 2011. [Google Scholar]
- Chen, C.; Yang, W.; Wang, Y.; Ricanek, K.; Luu, K. Facial Feature Fusion and Model Selection for Age Estimation. In Proceedings of the 2011 IEEE International Conference on Automatic Face and Gesture Recognition (FG), Santa Barbara, CA, USA, 21–25 March 2011. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell.
**2017**, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] - Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Luu, K.K., Jr.; Bui, T.; Suen, C. The Familial Face Database: A Longitudinal Study of Family-based Growth and Development on Face Recognition. In Proceedings of the Robust Biometrics: Understanding Science and Technology, Marriott Waikiki, HI, USA, 2–5 November 2008. [Google Scholar]
- Luu, K. Computer Approaches for Face Aging Problems. In Proceedings of the 23th Canadian Conference On Artificial Intelligence (CAI), Ottawa, ON, Canada, 31 May–2 June 2010. [Google Scholar]
- Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv
**2014**, arXiv:1411.1784. [Google Scholar] - Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv
**2017**, arXiv:1710.10196. [Google Scholar] - Duong, C.; Luu, K.; Quach, K.; Bui, T. Longitudinal Face Modeling via Temporal Deep Restricted Boltzmann Machines. In Proceedings of the 2016 IEEE Conference On Computer Vision And Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Duong, C.; Quach, K.; Luu, K.; Le, T.; Savvides, M. Temporal Non-volume Preserving Approach to Facial Age-Progression and Age-Invariant Face Recognition. In Proceedings of the 2017 IEEE International Conference On Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Mattia, F.D.; Galeone, P.; Simoni, M.D.; Ghelfi, E. A Survey on GANs for Anomaly Detection. arXiv
**2019**, arXiv:1906.11632. [Google Scholar] - Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5505–5514. [Google Scholar]
- Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
- Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
- Duong, C.; Luu, K.; Quach, K.; Nguyen, N.; Patterson, E.; Bui, T.; Le, N. Automatic Face Aging in Videos via Deep Reinforcement Learning. In Proceedings of the 2019 IEEE/CVF Conference On Computer Vision And Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Duong, C.; Luu, K.; Quach, K.; Bui, T. Deep Appearance Models: A Deep Boltzmann Machine Approach for Face Modeling. Int. J. Comput. Vis.
**2019**, 127, 437–455. [Google Scholar] [CrossRef] - Kingma, D.P.; Dhariwal, P. Glow: Generative Flow with Invertible 1x1 Convolutions. In Advances in Neural Information Processing Systems 31; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 10215–10224. [Google Scholar]
- Kingma, D.P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; Welling, M. Improved Variational Inference with Inverse Autoregressive Flow. In Advances in Neural Information Processing Systems 29; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 4743–4751. [Google Scholar]
- Dinh, L.; Krueger, D.; Bengio, Y. NICE: Non-linear Independent Components Estimation. arXiv
**2015**, arXiv:1410.8516. [Google Scholar] - Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using Real NVP. In Proceedings of the 3rd International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 2672–2680. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Hoogeboom, E.; van den Berg, R.; Welling, M. Emerging Convolutions for Generative Normalizing Flows. arxiv
**2019**, arXiv:1901.11137. [Google Scholar] - Papamakarios, G.; Murray, I.; Pavlakou, T. Masked Autoregressive Flow for Density Estimation. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 2335–2344. [Google Scholar]
- Behrmann, J.; Grathwohl, W.; Chen, R.T.Q.; Duvenaud, D.; Jacobsen, J.H. Invertible Residual Networks. In Proceedings of the 36th International Conference on Machine Learning, Beach, CA, USA, 10–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: Long Beach, CA, USA, 2019; Volume 97, pp. 573–582. [Google Scholar]
- Kim, H.; Papamakarios, G.; Mnih, A. The Lipschitz Constant of Self-Attention. arXiv
**2021**, arXiv:2006.04710. [Google Scholar] - Chen, R.T.; Behrmann, J.; Duvenaud, D.; Jacobsen, J.H. Residual flows for invertible generative modeling. arXiv
**2019**, arXiv:1906.02735. [Google Scholar] - Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 8 July 2021).
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. Proceedings of Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009. [Google Scholar]
- Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. Proceedings of International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Ho, J.; Chen, X.; Srinivas, A.; Duan, Y.; Abbeel, P. Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design. arXiv
**2019**, arXiv:1902.00275. [Google Scholar] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Germain, M.; Gregor, K.; Murray, I.; Larochelle, H. MADE: Masked Autoencoder for Distribution Estimation. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Bach, F., Blei, D., Eds.; PMLR: Lille, France, 2015; Volume 37, pp. 881–889. [Google Scholar]
- Truong, D.; Duong, C.N.; Luu, K.; Tran, M.; Le, N. Domain Generalization via Universal Non-volume Preserving Approach. In Proceedings of the 2020 17th Conference On Computer And Robot Vision (CRV), Ottawa, ON, Canada, 13–15 May 2020. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]

Description | Function | Reverse Function | Log-Determinant |
---|---|---|---|

ActNorm [29] | $\mathbf{y}=\mathbf{x}\odot \gamma +\beta $ | $\mathbf{x}=(\mathbf{y}-\beta )/\gamma $ | $\sum log\left|\gamma \right|$ |

Affine Coupling [32] | $\mathbf{x}=[{\mathbf{x}}_{a},{\mathbf{x}}_{b}]$ | $\mathbf{y}=[{\mathbf{y}}_{a},{\mathbf{y}}_{b}]$ | $\sum log\left|s\right({\mathbf{x}}_{b}\left)\right|$ |

${\mathbf{y}}_{a}={\mathbf{x}}_{a}\odot s\left({\mathbf{x}}_{b}\right)+t\left({\mathbf{x}}_{b}\right)$ | ${\mathbf{x}}_{a}=[{\mathbf{y}}_{a}-t\left({\mathbf{y}}_{b}\right)]/s\left({\mathbf{y}}_{b}\right)$ | ||

$\mathbf{y}=\left[{\mathbf{y}}_{a}{\mathbf{x}}_{b}\right]$ | $\mathbf{x}=\left[{\mathbf{x}}_{a}{\mathbf{y}}_{b}\right]$ | ||

$1\times 1$ conv [29] | ${\mathbf{y}}_{:,i,j}=\mathbf{W}{\mathbf{x}}_{:,i,j}$ | ${\mathbf{x}}_{:,i,j}={\mathbf{W}}^{-1}{\mathbf{y}}_{:,i,j}$ | $h.w.log|det\mathbf{W}|$ |

Our Shift Function | ${\mathbf{y}}_{c,i,j}={\alpha}_{c}{\mathbf{x}}_{c,i,j}+{\beta}_{c}$ | ${\mathbf{x}}_{c,i,j}=[{\mathbf{y}}_{c,i,j}-{\beta}_{c}]/{\alpha}_{c}$ | $h.w.{\sum}_{c}log\left|{\alpha}_{c}\right|$ |

Models | CIFAR-10 | ImageNet 32 | ImageNet 64 |
---|---|---|---|

RealNVP | 3.49 | 4.28 | 3.98 |

Glow | 3.35 | 4.09 | 3.81 |

Emerging Conv | 3.34 | 4.09 | 3.81 |

Ours | 3.50 | 3.96 | 3.74 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).