Multiple Network Fusion with Low-Rank Representation for Image-Based Age Estimation

Image-based age estimation is a challenging task since there are ambiguities between the apparent age of face images and the actual ages of people. Therefore, data-driven methods are popular. To improve data utilization and estimation performance, we propose an image-based age estimation method. Theoretically speaking, the key idea of the proposed method is to integrate multi-modal features of face images. In order to achieve it, we propose a multi-modal learning framework, which is called Multiple Network Fusion with Low-Rank Representation (MNF-LRR). In this process, different deep neural network (DNN) structures, such as autoencoders, Convolutional Neural Networks (CNNs), Recursive Neural Networks (RNNs), and so on, can be used to extract semantic information of facial images. The outputs of these neural networks are then represented in a low-rank feature space. In this way, feature fusion is obtained in this space, and robust multi-modal image features can be computed. An experimental evaluation is conducted on two challenging face datasets for image-based age estimation extracted from the Internet Move Database (IMDB) and Wikipedia (WIKI). The results show the effectiveness of the proposed MNF-LRR.


Introduction
Image-based age estimation tries to compute the age or age group with facial images.It can be widely used in many applications such as biometric feature recognition, human-computer interaction (HCI), and so on.Although a number of studies have been conducted [1][2][3], image-based age estimation is still a challenging task due to the following aspects.First, it often lacks sufficient training samples since each person may be captured by several images in a wide range of ages.Second, facial appearance may not indicate the age accurately since some people may look younger than they actually are and some people may look older.Third, facial images are often captured in wild conditions so they are influenced by large variations such as occlusion, lighting, shadow, and complex backgrounds.
Similar to many other applications of computer vision, most existing image-based age estimation approaches focus on two key stages: feature description and feature mapping.Feature description tries to represent facial images without losing details.Traditional methods usually uses texture features or shape features, such as the active appearance model (AAM) [4], holistic subspace features [5,6], local binary patterns (LBPs) [7], Gabor wavelets [4], bio-inspired features (BIFs) [8], and so on.However, most of them make use of hand-crafted features.In this way, strong prior knowledge is required.To solve this problem, learning-based feature descriptors [9,10] have been proposed to compute descriptive features directly from images.Recently, neural networks have been efficient in exploring descriptive representations in natural images, such as autoencoders [11], Convolutional Neural Networks (CNNs) [12], and so on.Among these methods, Liu et al. proposed group-aware deep feature learning (GA-DFL) to estimate ages with facial images [13].Different from most previous methods using hand-crafted features for facial image description, GA-DFL uses a deep CNN framework to compute a discriminative feature descriptor per image automatically from raw pixels of facial images.Although a large number of feature descriptors have been proposed, most of them can only describe a part of the information inherent in images.Therefore, researchers look into representing images with multiple features.Traditional methods make use of multiple features by directly concatenating them, which is oversimplified.To solve this problem, researchers also apply manifold learning to combine different types of features [14,15].
On the other hand, feature mapping tries to learn the mapping relationship from face images to age labels.With descriptive representations of facial images, age estimation is usually considered as a regression or classification problem [5,16].Linear regression and twin Gaussian processes are also used for pose estimation [17,18].Tian et al. proposed conducting age estimation by taking both ordinality and locality into consideration [19].Previous approaches made an over-simplified assumption that the mapping from images into poses is linear.To tackle this nonlinear issue, methods based on deep learning have been applied.They can train a series of nonlinear mapping models [20][21][22][23][24].However, these models cannot explicitly define the ordinal relationship between facial images and chronological ages, because they usually suffer from insufficient and unbalanced training data.In this way, they still cannot be used in practical scenarios.
Although many methods for age estimation with images have been proposed, they usually use only a single type of feature.Even with popular neural networks, they apply only a single structure of neural networks, which still suffers the so-called "semantic gap".Currently, multiple types of features have been used in many applications.Inspired by it, we proposed a Multiple Network Fusion with Low-Rank Representation (MNF-LRR) for the age estimation method.The contributions of this paper can be summarized by the following:

•
The first and key contribution is a novel framework that estimates ages with a single image by fusing multiple deep neural networks.This framework is flexible and the hidden representations are computed independently.In this way, different types of neural networks, different network structures, and different features can be used in this framework.

•
The second contribution of the proposed method is multiple-network fusion with low-rank learning.Low-rank representation is naturally sparse.Besides, different types of features are extracted by different networks and their distributions can be observed clearly.To improve traditional low-rank learning, we introduce a hypergraph manifold.In this way, samples can be represented in a unified low-rank space and the process of fusion can be achieved in this space.

•
The third contribution is that the performance of the proposed method is verified on datasets from the Internet Movie Database (IMDB) and Wikipedia (WIKI).They are challenging datasets since the images are collected in natural scenarios and not all of the faces are frontal.The performance on this dataset indicates that the proposed MNF-LRR is suitable for practical and complicated applications.

Overview of the Proposed Method
The process of the proposed method (MNF-LRR) can be summarized in Figure 1.To get rid of the influences of background, we should extract faces in images first.This process depends on the definitions of different datasets.In some datasets, such as IMDB and WIKI, the positions and sizes are provided and they can be used directly.However, in some other datasets or real scenarios, we need face detection or face tracking to determine the face area.We then utilize different networks to extract deep features of facial images.Finally, we use manifold learning based on low-rank representation to integrate the outputs of these networks.In this way, a unified multi-modal representation can be obtained.

Definitions
In age estimation with regression, given a set of images X = {x 1 , x 2 , ..., x n } and the corresponding labels Y = {y 1 , y 2 , ..., y n } with n pairs of samples, we try to learn a model that minimizes the loss: where F is the regression function, δ is the regression parameter, and X is the feature representation of X.Therefore, to minimize Equation (1), we need a descriptive X and a reasonable F .In the proposed method, we focus on X.

Multiple Network Learning
As mentioned in the introduction, multiple feature fusion has been proved to be effective in image representation.In this way, to compute X, we propose feature learning by fusing multiple neural networks, which compute features with different neural networks and integrate them to form new features.Neural networks [25,26] have been widely used to explore hidden representations of images and the effectiveness has been proved.Generally speaking, neural networks compute hidden representation by minimizing the loss function: where x i = Wx i is the hidden representation by mapping x i with weight W. The key to neural networks is optimizing W, which is defined differently by different neural networks.However, they depend on a large number of training data.Usually, age estimation with a single image is achieved with insufficient training samples or classification information.Therefore, we adopt different types of neural networks to extract different types of features and fuse them to improve the descriptive power with a small number of training samples.In MNF-LRR, we use the following neural networks to represent face images.

•
Autoencoders (AE).Autoencoders are unsupervised to learn the hidden representation.To solve Equation ( 2), people usually use denoising autoencoders (DAE).In DAE, inputs x 1 , ..., x n are corrupted by randomly removing some features.After corruption, x i is converted to xi and W : R d → R d is denoted as the transform matrix to reconstruct x i with xi .In this way, the squared reconstruction loss can be defined as The solution to Equation (3) depends on corrupted features of each input.To lower the variance, Marginal Denoising Autoencoders (MDA) [27] utilize multiple epochs with the training set, each epoch with different corruption settings.In this way, the overall squared loss can be transformed to where xi,j represents the jth corrupted features, and m is the number of epochs.
To represent features with the matrix form, X = [x 1 , ..., x n ] ∈ R d×n is denoted as the data matrix, while the m-epochs repeated version of X is denoted by X = [X, .., , X] and the corrupted version of X is denoted by X. Equation ( 4) can then be reduced to We can clearly figure out that Equation ( 5) is a convex problem, and the global optimal solution to it can be computed by setting its partial derivative for W to 0. We then need to compute partial derivative of loss(W), which is defined as

∂loss(W) ∂W
= 0 is set, and the close form to compute optimal W is • Convolutional Neural Networks (CNNs).CNNs are constructed by alternatively stacking convolutional layers and spatial pooling layers.Convolutional layers are key to CNNs since they generate feature maps by linear convolutional filters.The feature maps are then activated by nonlinear functions, which are called activation functions.Different activation functions are defined, such as rectifier, sigmoid, tanh, and so on.Taking the Rectified Linear Units (ReLUs) as an example, the feature maps can be computed by In computational networks, given an input or set of inputs, the activation function of a neuron defines the output of that neuron with these inputs.In the scenario of the deep neural network, activation functions project x v i to a higher level hidden representation step by step with a sequence of non-linear mappings, which can be defined as where l is the number of layers, and R is the mapping function from input to estimated output.
To optimize the weighted matrix W, which contains the mapping parameters, we use a back-propagation strategy.For each echo of this process, the weighted matrix is updated by ∆W, which is defined by η is the learning rate, and we can define In this way, we try to train a model that minimizes the differences between the groundtruth y i and the estimated output R(x i ).The back-propagation strategy can be modeled by • Recursive Neural Networks (RNN).RNNs process a structured input with the same set of weights recursively.In this way, we can traverse the given structure into topological order and obtain a structured output or a scalar prediction on it.Different from CNNs, nodes in RNNs are integrated into parents with a weight matrix.This matrix is shared across the whole network.Besides, a non-linearity such as activation functions mentioned above is used.Taking tanh as an example, if x i and x j are n-dimensional features of nodes, their parent must be an n-dimensional feature, too.It can be computed by where W is a learned n × 2n weight matrix, which is usually computed with Stochastic Gradient Descent (SGD).The gradients are calculated using back-propagation through structure (BPTS).BPTS is a variant of the aforementioned back-propagation through time for RNNs.
In the proposed MNF-LRR, by combining Equations ( 7), (11), and (13), Equation (2) can be rewritten as where W ae , W cnn , and W rnn are weighting parameters learned by autoencoders, CNNs, and RNNs, respectively.α, β, and γ are switches to turn on or off the corresponding neural networks.In this way, we can compute multi-modal feature representation.

Fusion with Low-Rank Representation
As mentioned before, multi-modal feature fusion by computing semantic relationship is more reasonable than simple concatenation.The key to learning the semantic relationship is how to define and compute affinities among data.Many existing methods can be used, such as subspace learning and manifold learning.Recently, low-rank learning attracts plenty of attention.In low-rank representation, assume the data is clean and is drawn from independent subspaces, then there exists Q * , which is block-diagonal, and the rank of each block equals the dimension of the corresponding subspace.Given the i-th modal X (i) , computed in the previous sub-section, we can compute the affinities among feature vectors by solving the minimization problem: where • * denotes the trace norm, and • 2,1 is the 2,1 -norm to characterize noise.λ > 0 is the parameter to balance the influences of the two parts.The optimal solution to Equation (15), which is denoted as Q * 0 , naturally defines an affinity relationship that implies the pairwise similarities between features.In this way, the similarity S (i) l can be computed by where (•) lk is the (l, k)-th element of a matrix.In previous methods, the above low-rank learning process can be used with only a single type of feature vectors.Therefore, we extend it to the multi-modal scenario and apply it to feature fusion.We define the multi-modal low-rank learning by min where α > 0 is a balanced parameter and m is the number of modals.In this way, we can infer a set of matrices Q (1) ,Q (2) , ..., Q (m)) .In this set, each n × n matrix Q (i) corresponds to the i-th modal X (i) .
The global solution defined by m × n 2 matrix Q is constructed by arranging Q 1 ,Q 2 , ..., Q m as follows: * is defined as the optimal solution to Equation (18).A universal affinity matrix can then obtained by quantifying the columns of Q: In manifold learning, the key to solving the manifold is computing the affinity matrix.Therefore, with the affinity matrix computed by Equation ( 19), we can construct the manifold in low-rank space to obtain fused feature descriptors.Specifically, we use affinity matrix Q to construct Laplacian matrix L. There are a number of solutions to this problem.In the proposed method, we follow the spectral hypergraph clustering method [28] and use it in the low-rank space.In this way, we consider each feature vector as a vertex v in the low-rank feature space.If some vertices share the same property, they are connected by a hyperedge e.In this method, L is defined as where I denotes an n × n identity matrix.Combined with LRR, C in our proposed framework is defined by In this equation, U is the matrix that indicates when a vertex belongs to a hyperedge if U i,j = 1.D e and D v are diagonal matrices, which contain degrees of hyperedge e and degrees of vertex v, respectively.Degrees of hyperedge are defined as the numbers of vertices that hyperedges connect, while degrees of vertex are defined as the sum of hyperedge weights connected to this vertex.
To compute U, we define that the vertices within a certain distance σ from a vertex form a hyperedge with this vertex.Therefore, U can be computed by With U (i) for the i-th modal, we use logistic OR to compute a unified U for all modals: Then, we can compute D e directly with U by summing each row: D v can be computed with Q by summing the items within d: With L, we apply the standard eigen-decomposition and obtain the eigenvectors corresponding to the d smallest eigenvalues.Finally, we can obtain the multi-modal features X with d × n dimensions.

Implementation of Age Estimation
In our implementation of age estimation, we use autoencoders, a CNN, and an RNN to extract the hidden representations.Among the activation functions, we use Rectified Linear Units (ReLUs) since ReLUs are inherently sparse and pretraining can be avoided.Then, their low-rank representations are computed and fusion is embedded in the low-rank space.With the unified affinity matrix, we compute L and use eigen-decomposition to obtain the fused features.Finally, the results are computed by softmax regression, which is mentioned as F in Equation ( 2).The developed system is implemented based on DeepLearnToolbox, which contains autoencoders and the CNN [29].We then add the RNN and low-rank learning to it.The settings of three neural networks are shown in Table 1.To make it possible to solve practical issues, TensorFlow version is under construction.

Settings and Datasets
Images in traditional face datasets are low-resolution and ages are not labeled.Therefore, Rothe et al. collected a large dataset of face images with age information [30], which has been made available for academic research purposes (Available at https://data.vision.ee.ethz.ch/cvl/rrothe/imdbwiki/(accessed on 8 May 2012)).To achieve this, they crawled the data of popular actors on the IMDb website and fetch their date of birth, name, and gender in their profiles.Besides, they crawled the same meta information and profile images of these people on the Wikipedia website.In this way, 460,723 face images were collected from 20,284 celebrities on IMDb, and 62,328 on Wikipedia were obtained.Sample images of these two datasets are shown in Figure 2. In our experiments, we used the two datasets individually.When we used IMDB, we randomly chose 100,000 images as the training samples and the rest as testing samples.When we used WIKI, we randomly chose 10,000 images as the training samples and the rest as testing samples.This process was repeated 20 times.Then, average performance and standard deviation were recorded.The evaluation was conducted on a desktop with NVIDIA 1080Ti.For evaluation, we used mean absolute errors (MAEs), which was computed by where Ŷ is the estimation results.For regression methods, results can be directly computed.
For classification methods, we simply treated the age estimation problem as a classification task of 100 classes.Y is the ground truth.

Optimization of Settings
As mentioned before, activation functions may influence the performance of neural networks.They define the mapped output of a node in different ways.In this way, different activation functions may influence the performance.Therefore, we have tried ReLUs, Sigmoid, and Tanh.The results of these three datasets are shown in Figure 3.Among the three activation functions, ReLUs achieved the best performance among all datasets, which matches recent publications.In the proposed framework, autoencoders, CNNs, and RNNs can be used.Different combinations of them are tested and results are shown in Figure 4.When we integrate the outputs of all neural networks, the performance is the best, which indicates the effectiveness of combining different neural networks.

Comparison of Multi-Modal Fusion Methods
We used different methods to integrate outputs of multiple neural networks.The following methods were used:

•
Low-Rank Representation (LRR): The proposed method using multiple neural network fusion and low-rank representation.

•
Concatenating Different Features (CON): For CON, features from different modals are simply concatenated to construct long features.Principle Component Analysis [31] is then used for dimensionality reduction.

•
Multiview Spectral Embedding (MSE) [32]: This method calculates a low-dimensional embedding.In this embedding, the distribution of each modal is sufficiently smooth.The complementary properties of different modals are then explored to obtain a fused representation.

•
Multi-View Hypergraph Learning (MHL) [33]: In this method, hypergraph learning is combined with the patch alignment framwork [34].A multi-view hypergraph Laplacian matrix is constructed, and fused features are computed by solving the standardeigen-decomposition of the multi-view hypergraph Laplacian matrix.
We computed the performance under different dimensions, and the results of different multi-modal fusion methods are shown in Figure 5.According to the figures, all these methods achieved optimal performance among [400, 600].However, optimal performance was not achieved on the same dimensionality.The proposed method uses LRR, and the best performance was achieved on 400 on IMDB and 500 on WIKI.Therefore, these settings were used in the other experiments.

Comparison of Different Methods for Age Estimation
For age estimation, we compared the following methods, including the proposed Multiple Network Fusion with Low-Rank Representation (MNF-LRR):

•
Multiple Network Fusion with Low-Rank Representation (MNF-LRR): The proposed method using multiple neural network fusion and low-rank representation.

•
Linear Regression (LR) [17]: This method estimates ages directly by linear regression against feature vectors of facial images.In this paper, HOG [35] was used as image features.
Ridge regression (RR-LR) and relevance vector machine (RVM-LR) regression were both implemented by the authors.Their results were similar.We used RVM-LR and set ν = 1000 in the experimental comparison.

•
Twin Gaussian Processes (TGPs) [18]: This method applies Gaussian process priors on both covariates and responses.Two Gaussian processes are then modeled as normal distributions over finite index sets of training and testing examples.Finally, outputs can be estimated by minimizing the Kullback-Leibler (K-L) divergence between them.The authors have provided several different implementations of TGP, such as Twin Gaussian Processes with K Nearest Neighbors (TGPKNN), Weighted K-Nearest Neighbor Regression (WKNNRegressor), Gaussian Process Regression (GPR), Hilbert-Schmidt Independent Criterion with K Nearest Neighbors (HSICKNN) and Kernel Target Alignment with K Nearest Neighbors (KTAKNN).We found that TGPKNN outperformed all other methods.Therefore, we used HOG for image features and TGPKNN as the regressor.

•
Convolutional Neural Networks (CNNs) [36].This method uses a simple convolutional net architecture.The network is composed of three convolutional layers and two fully connected layers.The authors have provided the Caffe model for age classification and deployed prototext.

•
Deep Expectation (DEX) [30].The authors here treated age estimation as a classification problem based on deep learning, which was followed by an expected value refinement with softmax.
The key to DEX for age regression contains deep learning models with a large amount of data, a robust face alignment process, and softmax-based expected value formulation.
The results of the experimental comparison are shown in Figure 6.Based on the results, we can make the following summarizations: 1.
The performance of general mapping learning methods such as LR and TGP is not satisfactory.They are fast and use traditional features such as HOG, but the definition of mapping relationship is oversimplified.

Discussion
According to the methodology of the developed system and the improvement of experimental performance, the novelty of the proposed method can be shown.
First, the developed system with Multiple Network Fusion with Low-Rank Representation tackles the problem of insufficient descriptive power with insufficient training data.To solve this problem, two types of solutions have been considered in previous methods.One is to improve the descriptive power of a single feature and the other is to fuse multiple features.To improve the descriptive power of a single feature, deep learning has been proved to be effective to represent images in the past few years.To fuse multiple features, manifold learning, low-rank learning, and so on are proposed.However, there have been few attempts to combine the above two solutions.We successfully combine deep learning and low-rank learning.In addition, we implement age estimation.Therefore, the proposed method is theoretically novel.
Second, experimental performance indicates the effectiveness of the proposed method, which can be summarized as follows: 1.
We compared different activation functions and different combinations of neural networks to determine the optimal neural network.2.
We compared different feature fusion methods to emphasize the effectiveness of choosing low-rank learning.

3.
We compared the proposed method with the state of the art in terms of age estimation to emphasize the overall improvement of the proposed method.
It can be concluded that the proposed method improves age estimation performance.

Conclusions
In this paper, we propose a data-driven method for image-based age estimation.Multiple Network Fusion with Low-Rank Representation (MNF-LRR) is designed to learn and integrate multi-modal features.First, multi-modal features extracted with different neural networks.Second, these features are represented on low-rank space and fused.In this way, a robust representation of facial images for age estimation is computed.In addition, the fused features are connected to softmax, and the estimation results can be obtained.Compared with state of the art, the proposed method is based on multiple features and utilizes multiple neural networks to compute features, improves the descriptive power of representations.We have conducted experimental evaluation on datasets from the Internet Movie Database (IMDB) and Wikipedia (WIKI).Performance comparison indicates the superiority of the proposed MNF-LRR over previous methods.

Figure 1 .
Figure 1.The flowchart of the proposed method (Multiple Network Fusion with Low-Rank Representation (MNF-LRR)).

Figure 2 .
Figure 2. Sample images from datasets from the Internet Movie Database (IMDB) and Wikipedia (WIKI).(a) IMDB; (b) WIKI.

Figure 3 .
Figure 3. Different activation functions.(a) Results of IMDB; (b) Results of WIKI.

Figure 4 .
Figure 4. Different combinations of networks.(a) Results of IMDB; (b) Results of WIKI.

Figure 5 .
Figure 5. feature fusion methods.(a) Results of IMDB; (b) Results of WIKI.

Table 1 .
The structures of different networks implemented by the proposed method.