Multimodal Low Resolution Face and Frontal Gait Recognition from Surveillance Video

: Biometric identiﬁcation using surveillance video has attracted the attention of many researchers as it can be applicable not only for robust identiﬁcation but also personalized activity monitoring. In this paper, we present a novel multimodal recognition system that extracts frontal gait and low-resolution face images from frontal walking surveillance video clips to perform efﬁcient biometric recognition. The proposed study addresses two important issues in surveillance video that did not receive appropriate attention in the past. First, it consolidates the model-free and model-based gait feature extraction approaches to perform robust gait recognition only using the frontal view. Second, it uses a low-resolution face recognition approach which can be trained and tested using low-resolution face information. This eliminates the need for obtaining high-resolution face images to create the gallery, which is required in the majority of low-resolution face recognition techniques. Moreover, the classiﬁcation accuracy on high-resolution face images is considerably higher. Previous studies on frontal gait recognition incorporate assumptions to approximate the average gait cycle. However, we quantify the gait cycle precisely for each subject using only the frontal gait information. The approaches available in the literature use the high resolution images obtained in a controlled environment to train the recognition system. However, in our proposed system we train the recognition algorithm using the low-resolution face images captured in the unconstrained environment. The proposed system has two components, one is responsible for performing frontal gait recognition and one is responsible for low-resolution face recognition. Later, score level fusion is performed to fuse the results of the frontal gait recognition and the low-resolution face recognition. Experiments conducted on the Face and Ocular Challenge Series (FOCS) dataset resulted in a 93.5% Rank-1 for frontal gait recognition and 82.92% Rank-1 for low-resolution face recognition, respectively. The score level multimodal fusion resulted in 95.9% Rank-1 recognition, which demonstrates the superiority and robustness of the proposed approach.


Introduction
The importance of identifying and monitoring the activity of registered offenders using video surveillance footage has been proven effective on several occasions, e.g., identifying the Boston bombing suspects, to lead the detectives in the right direction. However, the quality of the video data acquired by the surveillance system poses challenges. The primary causes of poor image quality recorded in most digital video surveillance systems are low resolution, excessive quantization, and low frame rate. Moreover, high-resolution video surveillance systems require excess storage space. These factors result in low-resolution biometric data, e.g., face images, obtained from the video surveillance clips collected using the existing video surveillance systems.
In this paper, we propose a solution for accurate human identification from lowresolution video surveillance footage by combining gait recognition and low-resolution (LR) face recognition. The proposed system, shown in Figure 1, is a fully automatic platform Due to the unavailability of proper datasets for multimodal face and gait recognition, the proposed studies in the literature were evaluated only on databases with a small set of subjects [1,2]. Moreover, majority of the approaches in the literature use lateral gait view, or use camera calibration, or even require multiple cameras for capturing multiple gait views to perform gait recognition. Gait cycle detection is critical for gait feature extraction and can be efficiently detected from the lateral gait view. Majority of the studies on gait recognition [3,4] perform the gait cycle approximation using various heuristic approaches from the biomechanics literature, which incorporate significant estimation error when applied on subjects with a wide-range of walking speed in large databases. In a practical situation, a system which estimates the gait parameters from a single view, without depending on the subject's pose or on camera calibration, is more realistic. We propose an efficient gait recognition technique through a robust gait cycle detection using frontal gait video clips. Previous studies of gait recognition in the literature apply either model-free or model-based approaches for feature extraction. We incorporate both the model-free (i.e., average walking speed for a subject, which is determined using the average number of frames, detected from the video, for the gait cycles) and the model-based (scale and translation invariant 3D moments for shape feature extraction) gait feature extraction approaches for robust identification.
A considerable amount of literature has been published on low-resolution face recognition. The majority of these studies use high resolution (HR) images/video to synthetically generate the corresponding low-resolution counterpart. Then a mapping function is obtained between the high-and low-resolution image pair. In this paper, we only use lowresolution face information obtained from the video surveillance data to train and later test the performance of the proposed low-resolution face recognition algorithm. It is evident from the experiment results that the proposed framework allows the system to learn the high-resolution mapping function from the low-resolution images, results in considerably higher classification accuracy by maximizing the signal-to-noise ratio. To the best of our knowledge, the proposed approach is the first fully automatic multimodal recognition framework using LR face images and frontal gait silhouettes from surveillance video clips. Compared to other studies, the performance is evaluated on a relatively large dataset.

Related Work
Even though research in face recognition has been active for the past few decades [5][6][7][8][9], the topic of low-resolution [10] face recognition has only recently received much attention, for long distance surveillance applications, to recognize faces from small size or poor quality images with varying pose, illumination, and expression. Although the state-of-the-art face recognition accuracy using data collected in constrained environments is satisfactory, the recognition performance in real world applications such as video surveillance is still an open research problem, primarily due to low-resolution (LR) images [11] and variations in pose, lighting conditions, and facial expressions.
Gait recognition [12,13] is a well proven biometric modality, which can be used to identify a person remotely through inspecting their walking patterns. However, gait recognition has some subtle shortcomings: it can be affected by the dressing attire, carrying large objects, etc. Moreover, the physical state, such as injuries, can also affect a person's walking pattern. Majority of the proposed gait recognition techniques [14,15] employ multi-view gait recognition to overcome the viewing angle transformation problem and to improve the recognition accuracy.

Low-Resolution Face Recognition
The literature on low-resolution face recognition can be categorized into three broad classes: (1) Mapping into unified feature space: In this approach, the HR gallery images and LR probe images are projected into a common space [16]. However, it is not straight forward to find the optimal inter-resolution (IR) space. Computation of two bidirectional transformations from both HR and LR to a unified feature space usually incorporates noise. (2) Super-resolution: Many researchers used up-scaling or interpolation techniques, such as cubic interpolation on the LR images. Conventional up-scaling techniques usually are not good for the images with relatively lower resolution. However, superresolution [17,18] methods can be utilized to estimate HR versions of the LR ones to perform efficient matching. (3) Down-scaling: Down-sampling techniques [11] can be applied on the HR images followed by comparison with the LR image. However, these techniques are poor in performance for solving LR problem, primarily because the downsampling reduces the high-frequency information which is crucial for recognition.
Due to the challenges and importance for real world applications, low-resolution (LR) face recognition has gradually become an active research area of Biometrics in recent years. Ren et al. [16] proposed a novel feature extraction method for face recognition from LR images, i.e., coupled kernel embedding, where a unified kernel matrix is constructed by concatenating two individual kernel matrices obtained, respectively, from HR and LR images. Sumit et al. [19], proposed an approach for building multiple dictionaries of different resolutions and after identifying the resolution of the probe image a reconstruction error based classification is obtained. A very low resolution (VLR) face recognition technique is proposed in [11] with a resolution lower than 16 × 16 by modeling the relationship between HR and VLR images using a piecewise linear regression technique. A super-resolutionbased LR face recognition technique is proposed by Yu et al. [20], where an LR face image is split into different regions based on facial features and the HR representation of each section is learned separately. Jia et al. [21] proposed a unified global and local tensor space representation, to obtain the mapping functions to acquire the HR information from the LR images to perform efficient LR face recognition.

Gait Recognition
The first step of gait recognition is background subtraction. The feature extraction techniques in the literature [22] can be categorized broadly in two classes: (1) Model-free approaches: In the model-free gait representation [23], the features are composed of a static component, i.e., size and shape of a person, and a dynamic component, which portrays the actual movement. Examples of static features are height, stride length, and silhouette bounding box. Whereas dynamic features can include frequency domain parameters like frequency and phase of the movements. (2) Model-based approaches: In the model-based gait representation approaches [13,24] we need to obtain a series of static or dynamic gait parameters via modeling or tracking the entire body or individual parts such as limbs, legs, and arms. Gait signatures formed using these model parameters are utilized to identify an individual.
Model-free approaches are usually insensitive to the segmentation quality and less computationally expensive compared with the model-based approaches. However, the model-based approaches are usually view-invariant and scale-independent compared with the model-free counterpart.
To obtain the gait signature utilizing a sequence of gait silhouettes, Davis et al. [25] proposed the motion-energy image (MEI) and motion-history image (MHI) which transform the temporal sequence of silhouettes to a 2D template for gait identification. Later, Han and Bhanu [13] adopted the idea of motion-energy image (MEI) and proposed the gait energy image (GEI) for individual recognition using gait images. Frequency analysis of spatio-temporal gait signals is used by researchers to model the periodical gait cycles. Lee et al. [23] proposed a model-free approach to first divide the gait silhouette into seven regions and align them with ellipses, later apply Fourier Transform on the fitted ellipses to extract the magnitude and phase components for classification. Goffredo et al. [26] proposed a k-nearest neighbor classifier (k-NN) for front-view gait recognition where the gait signature is composed of shape features extracted from sequential silhouettes.

Multimodal Face and Gait Recognition
The fusion of face and gait modalities have recently received significant attention [27], mainly motivated by their impact on security related applications. The fusion of the two modalities has been used in the literature to obtain more robust and accurate identification. The fusion can be performed at the feature/sensor level, the decision level, or the matching score level.
In [28], features from high-resolution profile face images and features from gait energy images are extracted separately and combined at the feature level, later the fused feature vector is normalized and used for multimodal recognition. The experimental results on a database of video sequences for 46 individuals demonstrate that the integrated face and gait features result in a better performance than the performance obtained from the individual modalities. Shakhnarovich et al. [1] proposed a view normalized multimodal face and gait recognition algorithm and evaluated it on a dataset of 26 subjects. First, the face and the gait features are extracted from multiple views and transformed to the canonical pose frontal face and the profile gait view, later the individual face and gait recognition results are combined at the score level. In [2], a score level fusion of face and gait images from a single camera view is proposed and tested on an outdoor gait and face dataset of 30 subjects. The results of a view-invariant gait recognition algorithm, and a face recognition algorithm based on sequential importance sampling are fused in a hierarchical and holistic fashion. Geng et al. [29] proposed a context-aware multi-biometric fusion of gait and face which dynamically adapts the fusion rules to the real-time context and respond to the changes in the environment.

Materials and Methods
To perform multimodal biometric recognition, we need to detect the face and the gait silhouette from the surveillance video clips. The surveillance video clips are captured by a static video camera which records the frontal view of a walking person. The subjects start walking form a distance directly approaching the camera. We extract both the frontal gait silhouettes from the sequence of video frames and the low-resolution frontal face images, as explained below. We adopted the fast object segmentation method proposed by Papazoglou et al. [30] for segmenting the foreground silhouette from the background. The fast object segmentation is fast, fully automatic, and has minimal assumptions about the motion of the foreground. Therefore, it performs efficiently in cases of unconstrained settings, presence of rapidly moving objects, arbitrary object motion and appearance change, and non-rigid deformations and articulations. The fast object segmentation technique first produces a rough estimate of the pixels that are inside the object, based on motion boundaries using optical flow obtained from pairs of subsequent frames [31,32]. In the second step, a spatiotemporal extension of GrabCut [33,34] technique is used to bootstrap an appearance model based on the initial foreground estimate, and refine it by integrating information over the entire video sequence. An example of segmented silhouettes from different frames, using fast object segmentation [30], is shown in Figure 2, which shows accurate segmentation by isolating the silhouette from its reflection on the shiny floor. To automatically detect the low-resolution frontal face images in the surveillance video clips, we adopt the Adaboost object detection technique, proposed by Viola and Jones [35]. The algorithm is trained to detect low-resolution frontal faces using manually cropped frontal face images from the color FERET [36] database. By using the trained detector, we can detect low-resolution faces in the video frames. The trained detector is applied to the entire video sequence to detect the LR frontal faces. An example of the detection results from a surveillance video clip is shown in Figure 3. Existing studies in the literature [37,38] suggests that human periodic movement speeds and patterns are similar in repeated trials of the same subject. We have incorporated both model-free and model-based feature representation of the segmented silhouettes to obtain accurate and efficient gait recognition. Identification of the gait cycle, using the frontal gait video, is proposed to compute average movement speed for efficient model-free gait recognition. Moreover, model-based gait energy image (GEI) [13] features are also extracted to perform view-invariant and scale-independent gait recognition.
In the following subsections, we described the proposed method of gait cycle identification to compute average movement speed and 3D moments from the spatio-temporal GEI shape feature, using the segmented silhouettes.

Gait Feature Representation
In this section, we first define the gait cycle and then describe the proposed approach to identify gait cycle using only frontal gait information.
The gait cycle [37] can be defined as the time interval between two successive occurrences of the repetitive phases while walking. The gait cycle involves two principal stages: the stance phase and the swing phase. The stance phase occupies 60% of the gait cycle, while the swing phase occupies only 40%, as explained in Figure 4. The stance phase consists of Initial Contact, Loading Response, Midstance, Terminal Stance, and Pre-swing. Whereas, the swing phase is composed of Initial Swing, Mid Swing, and Terminal Swing. Stance phase begins with the heel strike-this is the moment when the heel begins to touch the ground, but the toes do not yet touch. We can see from Figure 4, during the Stance phase, in the Midstance position, the difference between the lower points (or pixel locations) of the two limbs is maximized. Similarly, in the Midswing position of the swing phase, in-between the Initial Swing and the Heel Strike, the distance between the lower points of the two limbs is maximized. Whereas, during the Terminal-swing through Loading Response stages, the distance between the lowest white pixel of the two limbs is minimized. Following this specific attribute of the gait cycle, we can analyze gait cycle from the frontal silhouette. In Figure 5, we can see that in the silhouette bounding box of frames 138 and 152 the difference between the lowest white pixel of the two limbs is maximum which indicates the successive events of the Midstance through Midswing phase. Moreover, in the silhouettes of frame 144 and frame 158, the difference between the lowest white pixel of the two limbs is minimum, which signifies the successive events of Pre-swing through Terminal swing. Therefore, we can identify the entire gait cycle from the sequence of frontal gait silhouettes starting from the Initial Contact (frame 135) through Terminal swing (frame 158) in Figure 5. Identifying the gait cycles from the gait video is usually the initial step in gait analysis for separating the periodic occurrences of the walking sequence. Majority of the techniques [13,39] in the literature perform the detection of the gait cycle using profile gait view or multiple gait views due to the ease of discrimination of different gait phases as described earlier. As per biological studies [40,41] of the human gait cyclic phases during walking, the body pose changes periodically and the upper and lower limbs move symmetrically. Since the width and height of the bounding box of the binary silhouette directly depends on the limb's fluctuation, we represent the gait fluctuation as a periodic function which depends on the silhouette's width and height over time. In a frontal gait video, as the subject is moving towards the surveillance camera, the silhouette's height and width will be increasing in the later frames compared with the earlier ones. To compensate for these scale variations, we normalized [42] the width and height of the silhouette bounding box. Based on the theoretical premise of the gait cycle and the experimental observations using the frontal gait video clips, we propose a gait cycle identifier which represents the periodic motion and cyclic phases as follows: where GC identifier (f t ) is the variable that represents the gait cycle phase for the tth frame (f t ), H norm (f t ) and W norm (f t ) are the silhouette's bounding box height and width for the tth frame after normalization to compensate for the scale variations. The second term in Equation (1) is the normalized difference between the lowest white pixels of the two limbs, where H(f t ) is the height of the t-th frame. The multiplier 0.5 is used to normalize the value of the gait cycle identifier variable. The plot of the GC identifier (t f ) against the sequence of frames is shown in Figure 6.

Three Dimensional Moments from the Spatio-Temporal Gait Energy Image
After the silhouettes are segmented from each of the video frames, their heights are first normalized with respect to the frame height. The average silhouette image or the gait energy image (GEI) [13] represents the principal shape of the human silhouette and its change over a sequence of frames in a gait cycle. A pixel with higher intensity value in the GEI indicates that the human body was present more frequently at this specific position. Equation (2) is used for obtaining the pixel values of the GEI: where t stands for the temporal frame number from which the silhouette is obtained, F is the total number of frames in a complete gait cycle, B t (x, y) stands for the binary silhouette. Spatio-temporal GEI or the periodic gait volume V(x, y, n) is obtained from the GEIs computed using the gait cycles in a gait video clip, where n represents the gait cycle number. Even though GEI suffers from some information loss of the details, it has numerous benefits compared with the representation of binary silhouettes as a temporal sequence. Since GEI is the average of a sequence of silhouettes, it is not very sensitive to errors in the silhouette segmentation in the individual frames. The robustness of the GEI is improved by discarding pixels with the energy values lower than a predefined threshold.
Shape analysis is a complex problem due to the presence of noise, and in certain cases, variations between shapes result in significant changes in the measured feature values. To recognize objects from their shape, features such as eccentricity, Moments, Euler number, compactness, and convexity are widely used in the literature [43]. Moments or central moments are used as quantitative measures for shape description [44]. Hu et al. [44] derived a set of moment invariants for various geometric shapes. Moments are widely used in various complex shape-based object recognition [45] due to the fact that they are invariant to orientation.
Three-dimensional raw moments for the s-temporal GEI or periodic gait volume for each gait cycle can be represented as: where order (O) of 3D moments can be represented as: O = p 1 · p 2 · p 3 . For any translation, e.g., (a, b, c), of the 3D coordinates of the center of mass of the object, the change in the three dimensional moments 3DMoment p 1 p 2 p 3 can be represented as: When the center of mass (x, y, n) is at the origin, the raw moments and the central moments are the same. Thus, the central moment μ p 1 p 2 p 3 can be represented by replacing a, b, c with the mean value of x, y, n, respectively: Here, x = m 100 m 000 ; y = m 010 m 000 ; n = m 001 m 000 , where m 000 is the zeroth spatial moment, and m 100 , m 010 , and m 001 are the x, y, and n components of the first spatial moment, respectively. The pixel on the pereodic gait volume, e.g., [x j (n), y j (n)], of V(x, y, n) is the jth point that belongs to the n-th gait cycle. Hence, the 3D central moment of the spatio-temporal GEI or the periodic gait volume can be represented as: where P(n) is the total number of pixels on the periodic gait volume for gait cycle n. Following the method mentioned in the previous two sections, we obtained the scale and translation invariant three-dimensional moments of the periodic gait volume (μ GEI vol p 1 p 2 p 3 ). Additionally, the average number of frames in the gait cycles identified using the frontal walking video clip is used as the average movement speed of the subject. We used these two components together to obtain the gait signature used for classifying the subjects through gait recognition.

Low-Resolution Face Feature Representation
In this section, we describe the proposed algorithm for low-resolution face recognition from surveillance video clips. The description of the components used in the algorithm are detailed in the subsequent sections. Here the algorithm refer to the proposed technique in the manuscript.

Super-Resolution
Super-resolution (SR) [17,18] is a class of image processing algorithms, used to enhance the resolution of low-resolution images. SR algorithms can be used to enhance the resolution of an image from single or multiple low-resolution images. Interpolation techniques such as nearest neighbor, bilinear and cubic convolution are widely used for SR processing of the LR images in the literature.
The two key components of a digital imaging system are the sensor and the lens, those introduce two types of image degradation, specifically optical blur and limitation on the highest spatial frequency that can be recorded. The sensor is constructed from a finite number of discrete pixels which results in the presence of so-called aliased components in the sensor output. These correspond to high spatial-frequency components in the scene that are higher than frequencies that the sensor can handle and should not normally be present in the output. These are the key components used by the SR algorithms to obtain the HR representation. The available SR algorithms can be categorized broadly into two major classes: reconstruction-based SR and recognition-based SR. The reconstruction-based methods are suitable for synthesizing local texture resulting in better visualization and do not incorporate any specific prior information. However, recognition-based SR [17,18] algorithms try to detect or identify certain pre-configured patterns in the low resolution data.
The recognition-based SR algorithms [17] learn a mapping correspondence between low and high resolution image patches from the training LR and HR images, which can be directly applied to a test LR image to construct the HR counterpart. In the training phase densely overlapping patches are cropped from the low-resolution and high-resolution image pair. Followed by jointly training two dictionaries for the low-and high-resolution image patches by enforcing the similarity of sparse representation for each image pair. Given the trained LR and HR dictionaries and a test LR image, the algorithm obtains its HR representation in three steps. First, densely overlapping patches are cropped from the LR input image and pre-processed (i.e., normalization). Second, the sparse coefficients obtained from the LR dictionary for the LR test image patches are passed into the high-resolution dictionary for reconstructing the high-resolution patches. Finally, the overlapped HR reconstructed patches are aggregated (i.e., weighted averaging) to produce the final output.
Convolutional neural network (CNN) [46] was developed several decades ago and deep Conv Nets [47] have recently been popular among researchers primarily due to its success in image classification. CNN is a specific artificial neural network topology, that is inspired by biological visual cortex, formed by stacking multiple stages of feature extractors. CNN have also been used successfully for other computer vision applications, such as object detection, face recognition, and pedestrian detection.
Dong et al. [18] proposed a CNN-based SR algorithm, which directly learns an endto-end mapping between the low-and high-resolution image pair. The three components of the pipeline in the recognition-based SR algorithms are represented as different layers of CNN, which efficiently optimize the entire SR implementation through the CNN. The mapping is represented as a deep convolutional neural network (CNN) that takes the low-resolution image as the input and outputs the high-resolution one. The first step is patch extraction and representation. The recognition-based SR algorithms [17] use the densely extracted patches and then represent them by a set of pre-trained bases such as PCA, DCT, and Haar. This is equivalent of convolving the image by a set of filters, each of which is a basis. Thus, the first layer of the CNN can be expressed as: where W 1 and B 1 represent the filter weights and biases, respectively, ' * ' denotes the convolution operation. W 1 is of size c × f 1 × f 1 × n 1 , corresponds to n 1 filters of spatial size f 1 × f 1 and c stands for the number of channels in the image, that applies n 1 convolutions on the image. The output is composed of n 1 feature maps. B 1 is an n 1 -dimensional bias vector, whose each element is associated with a filter. The second component of the recognitionbased SR algorithm pipeline can be represented using the non-linear mapping step of CNN. As shown in Equation (8), the first layer extracts an n 1 -dimensional feature vector for each patch. In the second operation, each of these n 1 -dimensional vectors is mapped into an n 2 -dimensional vetor. The operation of the second layer can be represented as: here W 2 is of size n 1 × f 2 × f 2 × n 2 , corresponds to n 2 filters of spatial size n 1 × f 2 × f 2 , and B 2 is n 2 -dimensional bias. Each of the output n 2 -dimensional vectors is a representation of a high-resolution patch that will be used for SR reconstruction. Finally, the reconstruction step in the recognition-based SR algorithm pipeline produces the final HR image by averaging the overlapping high-resolution patches. The averaging can be considered as a pre-defined filter on a set of feature maps, where each position is the flattened vector form of a high-resolution patch.
where W 3 is of size n 2 × f 3 × f 3 × c, corresponds to c filters of a spatial size n 2 × f 3 × f 3 , and B 3 is a c-dimensional bias vector. The values of the parameters n 1 , n 2 , n 3 , f 1 , f 2 , and f 3 used in the experiments are detailed in the experimental results, Section 4.4. The super-resolution pre-processing technique is used to obtain high-resolution representation of the low-resolution face images as shown in Figure 7. We can see that the performance of the CNN-based super-resolution recovery method face is better than the performance of sparse-based super-resolution technique.

Illumination and Pose Invariance
In this section, we explain the preprocessing steps for normalizing the low-resolution images with respect to illumination and pose variations.
It has been proven in the literature, that illumination variations are among the primary problems in biometric authentication. We adopted the Self-quotient image (SQI) [48] to normalize the illumination variations in the low-resolution facial images. SQI incorporates an edge-preserving filtering technique to minimize the spectral variations present in the illumination.
The Lambertian model can be factorized into two parts, the intrinsic part, and the extrinsic part: I(x, y) = ρ(x, y) n(x, y) T · s = F(x, y) · s, where ρ is the albedo and n is the surface normals. F = ρn T depends on the albedo and surface normal of an object and hence is an intrinsic factor, where F represents the identity of a face. However, s is the illumination and is an extrinsic factor. Separating the two factors and removing the extrinsic component is a key to achieve a robust face recognition by normalizing the effect of varying illumination. The SQI image Q of an image I can be represented as: where I is the smoothed version of I, P is the smoothing kernel, and the division is pixelwise. SQI [48] achieves the removal of extrinsic component s in Equation (11) through a two-step process. First, an illumination estimation step: the extrinsic factor is estimated to generate a synthesized smooth image, which has same illumination and shape as the input but a different albedo. Second, an illumination effect subtraction step: the illumination is normalized by computing the difference between the logarithms of the albedo maps of the input and the synthesized images, (logρ 0 -logρ 1 ). Pose variations present a major problem in real-world face recognition applications. Since the human face is approximately symmetric, if it is in the frontal pose with no rotations, the matrix containing the face image (F) will have the lowest rank. Employing the above-stated principle, Zhang et al. [49] proposed transform invariant low-rank textures (TILT) to normalize the pose of a rotated frontal face and remove minor occlusions.
TILT [49] tries to find a transformation (Euclidean, affine, or projective) matrix τ, through modeling the face rotation using an error matrix E, s.t. F * τ = F + E, where F represents the deformed and corrupted face and F is the corrected low-rank face image, by optimizing the following equation: where ||E|| o is the l 0 -norm of the error matrix, i.e., number of non-zero elements. It actually finds the corrected low-rank face image (F) with the lowest possible rank and the error with the lowest number of non-zero elements, which satisfy the above condition. γ trades off the rank of the matrix and the sparsity of the error. Optimizing the rank function and the l 0 -norm in the above equation is very challenging. Therefore, they are substituted by their convex surrogates. Since the rank of a matrix is equivalent to the number of its non-zero singular values, we can substitute the rank(F) by its nuclear norm ||F|| * , which is the sum of its singular values. Moreover, l 0 -norm is substituted by l 1 -norm, which is the sum of the absolute values of the elements of the matrix. Additionally, the constraint F * τ = F + E is non-linear. By linearizing the constraint around its current estimate through an iterative process, the optimization problem becomes as follows: min F,E,Δτ where ∇ represents the Jacobian. Finally, we train a binary classifier using local features (Local Binary Pattern) to remove the false positive frames detected by the Adaboost face detector.

Registration and Synthesizing Low-Resolution Face Images
In this section, we describe the image registration of the pre-processed and normalized face regions, and synthesizing them using Curvelet and Inverse Curvelet transformation. We adopted the subspace-based holistic registration (SHR) method [50], which was proposed to perform registration on low-resolution face images. The majority of the automatic landmark-based registration methods can only perform accurate registration on high resolution images. However, SHR is able to obtain a user independent face model using Procrustes transformation by incorporating the image edges as feature vectors to register low-resolution face images. The best registration parameters are iteratively obtained through the downhill simplex optimization technique by maximizing the similarity score between the probe and the gallery image. The registration similarity is calculated using the probability that the probe and gallery face images are correctly aligned in a face subspace by computing the residual error in the dimensions perpendicular to the face subspace.
The first step of obtaining the subject independent face model to perform the registration is to compute the edges in the low-resolution facial image. Gaussian kernel derivatives of the LR face images are calculated in the x and y directions, respectively, using G x and G y as follows: G y (x, y) = -y 2πσ 4 exp(- The derivatives H x and H y of the images are obtained by convoluting the LR face image with G x and G y resulting in the "edge images" used for the registration purpose. Procrustes transformation is used to align the probe image to the gallery image by correcting the variations of scale by a factor f, rotatation with an angle α, and translation of u, while preserving the distance ratios. Given a pixel location p = (x, y) T , the transformation U θ p on a pixel location can be represented as: where θ = {u, α, f} represent the registration parameters, and R(α) is the rotation matrix. The transformation of the entire probe image to perform the registration operation is obtained by applying U θ on the computed "edge images" as follows: where H = H 2 x + H 2 y . Thus, a registered and aligned image, T θ H(p), is obtained through backward mapping and interpolation by utilizing the optimal registration parameter θ found using simplex optimization technique.
To enhance the spectral features for face recognition, image synthesizing methods [51] are very popular in the literature. The synthesizing methods available in the literature can be broadly categorized into two classes, one performs the synthesis in the spatial domain and the other in the frequency domain. In this paper, we adopted the Curvelet-based image synthesis [52] which uses the Curvelet coefficients [53] to represent the face.
Curvelet transform has improved directional capability, better ability to represent edges, and other singularities along curves as compared to other traditional multiscale transforms, e.g., wavelet transform. First, curvelet transforms are applied to the sequence of registered face images. The smallest low-frequency components are represented by the coarse Curvelet coefficients and the largest high-frequency components are represented by the fine Curvelet coefficients. For the image sequence I 1 , I 2 , . . . , I n , the Curvelet coefficients can be represented as C I i {j}{l}, where i = 1, 2, . . . , n represent the image sequence to be synthesized, and j, l are the scale and direction parameters, respectively. The components of the first scale where j = 1 represent the low-frequency parts of the face images, and the components associated to other scales (j > 1) represent the high-frequency parts. The minimum components between each C I i {1}{l}, where scale j = 1, and (i = 1, 2, . . . , n), and the maximum components between each C I i {j}{l}, where (j = 2, . . . , 5), and i = 1, 2, . . . , n are retained for the synthesized Curvet coefficients. Inverse Curvelet transformation of the synthesized Curvelet feature vector generates the synthesized image used for feature extraction.

Feature Extraction
We obtain LBP and Gabor features from the fused image and compare their performance for recognition. In the subsequent sections, we describe the LBP and Gabor feature extraction techniques.
The original LBP operator, introduced by Ojala et al. [54], is a powerful method for texture description. The operator labels the pixels of an image by thresholding the 3 × 3-neighborhood of each pixel with the center value and considering the result as a binary number. Then, the histogram of the labels can be used as a texture descriptor. See Figure 8 for an illustration of the basic LBP operator.
Later, the operator was extended to use neighborhoods of different sizes. Using circular neighborhoods and bilinearly interpolating the pixel values allow any radius and number of pixels in the neighborhood. For neighborhoods, we use the notation (P, R) which means P sampling points on a circle of radius of R. Figure 9 shows an example of the circular neighborhood (8,2). Another extension to the original operator uses what is called uniform patterns. A Local Binary Pattern is called uniform if it contains at most two bitwise transitions from 0 to 1 or vice versa when the binary string is considered circular. For example, 00000000, 00011110, and 10000011 are uniform patterns. Ojala et al. [54] noticed that in their experiments with texture images, uniform patterns account for a bit less than 90% of all patterns when using the (8,1) neighborhood and for around 70% in the (16,2) neighborhood.
An extension of LBP-based face description method is proposed by Ahonen et al. [55]. The facial image is divided into local regions (k × k window) and LBP texture descriptors are extracted from each region independently. The descriptors are then concatenated to form a global description of the face that describes the facial image in a high dimensional feature space. Window sizes used for experiment purposes are k = 3, 5, 7.  2D Gabor filters [56] are used in a broad range of applications [57] to extract scale and rotation invariant feature vectors. In our feature extraction step, uniform down-sampled Gabor wavelets are computed for the detected regions using Equation (18), as proposed in [58]: where z = (x, y) represents each pixel in the 2D image, k μ,ν is the wave vector, which can be defined as k μ,ν = k ν e iφ μ , k ν = k max f ν , k max is the maximum frequency, and f is the spacing factor between kernels in the frequency domain, φ μ = πμ 2 , and the value of s determines the ratio of the Gaussian window width to wavelength. Using Equation (18), Gabor kernels are generated from one filter using different scaling and rotation factors. In this paper, we used five scales, ν ∈ 0, ..., 4 and eight orientations μ ∈ 0, ..., 7. The other parameter values used are s = 2π, k max = π 2 , and f = √ 2. Gabor features are computed by convolving each Gabor wavelet with the synthesized super-resolution pre-processed LR face images, as follows: where T(z) is the face image, and z = (x, y) represents the pixel location. The feature vector is constructed out of C μ,ν by concatenating its rows.

Experimental Results
In this section, we first describe the Face and Ocular Challenge Series (FOCS) [59] dataset. Then, we demonstrate the experiments and results of the frontal gait recognition and low-resolution face recognition followed by the score level fusion to obtain the multimodal recognition. While evaluating the performance of the biometrics recognition algorithm, we query each test data instance against all the subjects present in the gallery data, which results in the classification probabilities over all the subjects. However, while reporting Rank-1 accuracy, we only count the instances as true positive when the subject with the highest classification probability is the exact match with the test subject, thus Rank-1 accuracy is the precise measure which penalizes all strictly incorrect classification results, while a rank-k for k > 1 allows for some error.

FOCS Dataset
The video challenge dataset, Face and Ocular Challenge Series (FOCS) [59], contains video sequences of individuals, acquired on different days. Students from The University of Texas, Dallas, between the age group of 18 and 25, volunteered for the data collection. The FOCS dataset is collected in two sessions, where in the second duplicate session of data collection the subjects have a different hairstyle, different clothing, and may be otherwise different in appearance.
The FOCS database contains a variety of still images and videos of a large number of individuals taken in a variety of contexts. For our experiments, we used the frontal walking video sequences. The videos in the FOCS database were collected using a Canon Optura Pi digital video camera, this applies a single progressive scan CCD digitizer resulting in minimal motion aliasing artifacts. The videos were stored in DV Stream format at 720 × 480 with 24-bit color resolution and 29.97 frames per second. In the frontal walking video sequences, the subject walks parallel to the line of sight of the camera, approaching the camera, but veering off to the left while reaching in front of the camera. These frontal walking video sequences capture the subject from the start point until he/she goes out of view. Thus, the time varies somewhat for each subject due to their walking speed, but on average it is approximately 10 seconds. The FOCS frontal walking video sequences contain videos that are acquired from 136 unique subjects. The number of samples per subject varies. Out of 136 subjects, 123 subjects have at least 2 videos. We used data from these 123 subjects for our experiments, where one of the video clips is randomly chosen for training and the other is used for testing.

Experimental Setup
To perform the multimodal recognition, we first segment the frontal gait silhouette from the background and detect the low-resolution face images from the frontal walking video sequences as described in Section 3. In order to evaluate the proposed algorithm, we perform the frontal gait recognition and the low-resolution face recognition experiments as two separate components. Later, we use the match score level fusion scheme to fuse the individual recognition results.

Frontal Gait Recognition
Once the binary gait silhouette is acquired, we obtained the scale and translation invariant 3D moments of the spatio-temporal GEI or the periodic gait volume, and the average number of frames in the identified gait cycles in the frontal walking video clips to prepare a high dimensional feature vector as described in Section 3.1.1. The frontal gait features were classified using a k-nearest neighbor classifier (k-NN), where a test gait feature vector belongs to the class that minimizes the similarity distance between the gallery and the probe gait feature vector.
We performed quantitative experiments using different moment orders O = p 1 · p 2 · p 3 . The best recognition performance was obtained when p 1 = p 2 = p 3 = 10. The results of frontal gait recognition are presented in Table 1. We can see that the recognition performance when using the 3D moments of periodic gait volume is better than the performance when using the average movement speed feature representation. However, concatenating the feature vectors together improved the gait recognition performance. Table 2 shows a comparison between the frontal gait recognition performance achieved by the proposed approach and the recognition accuracy of the state-of-the techniques on the FOCS dataset. The proposed frontal gait recognition system achieves a rank-one recognition rate of 93.5% on the 123 subjects of FOCS dataset.

Low-Resolution Face Recognition
The first step of LR face recognition is to detect the low-resolution faces using the Adaboost detector from the video surveillance frames as described in Section 3. The proposed LR face recognition Algorithm 1 is described in Section 3.2. After employing the CNN-based super-resolution technique to obtain the high-resolution equivalent of the LR faces, we perform the illumination and pose normalization steps. The sizes of the pre-processed face images vary between 40 × 40 pixels and 180 × 180 pixels. To effectively leverage the high-frequency information present in the pre-processed face images, we separate the face images into two classes. The face images of size less than 96 × 96 pixels are labeled as Class -1 and those that are greater than 96 × 96 pixels are labeled as Class -2. We use the face image of size 72 × 72 pixel in Class -1 as the base or template image to apply the SHR registration technique [50], described in Section 3.2.3, to register all the face images that belong to Class -1 after rescaling them to 72 × 72 pixels. Similarly, the face image of size 120 × 120 pixels in Class -2 is used as the base or template image to apply the SHR registration technique [50] to register all the face images that belong to Class -2 after rescaling them to 120 × 120 pixel. After performing the image synthesis using the Curvelet coefficients as described in Section 3.2.3 of the face images in Class -1 and Class -2 separately, we obtain two synthesized face images for each surveillance video clip. We extract the LBP and Gabor feature vectors, as mentioned in Section 3.2.4, from the two synthesized face images and perform feature concatenation to obtain the composed LBP and Gabor features, which represent the LR face in the surveillance video clip. The obtained LBP and Gabor feature vectors are used separately to compare their performance in LR face recognition. For each of the 123 subjects used in the performance evaluation, the feature vector obtained from one randomly chosen surveillance video clip is used to build the model and the one obtained from the other video is used for testing. We compare the performance of the proposed LR face recognition technique using the CNN-based super-resolution [18] with the following baseline algorithms. It is worth noting that all the comparisons are based on the same training/test set.
(1) LR face recognition without any super-resolution pre-processing technique.
The obtained LR face features were classified using a k-nearest neighbor classifier (k-NN), where a test LR facial feature vector belongs to the class that minimizes the similarity distance between the gallery and the probe feature vector. The result of low-resolution face recognition is presented in Table 3. We can see that the performance when using the local feature representation LBP is better than the performance when using global feature representation or Gabor features. Moreover, by employing the CNN-based superresolution technique the LR face recognition performance is increased to 82.91% compared with 72.36% without any SR pre-processing of the LR face images.

Multimodal Recognition Accuracy
Score level fusion techniques are very popular in multimodal biometrics applications specifically in the application of fusing face and gait [2,27]. In our experiment, results from the different classifiers were combined directly using the Sum, Max, and Product rules.
To prepare for fusion, the matching scores obtained from the different matchers are transformed into a common domain using a score normalization technique. Later, the score fusion methods are applied. We have adopted the Tanh score normalization technique [60], which is both robust and efficient, defined as follows: where s j and s n j are the match scores before and after normalization, respectively. μ GH and σ GH are the mean and standard deviation estimates of the actual score distribution given by Hampel estimators [61], respectively. Hampel's estimators are based on the influence functions ψ which are odd functions and can be defined for any x (matching score, s j , in this paper) as follows: x, 0 ≤ |x| < a, a sgn(x), a ≤ |x| ≤ b, a(r-|x|) r-b sgn(x), b ≤ |x| ≤ r, 0, r ≤ |x|, where sgn(x) = +1, if x ≥ 0, -1, otherwise, In Equation (21), the values of a, b, and r in ψ reduce the influence of the scores at the tails of the distribution during the estimation of the location and scale parameters, i.e., μ GH and σ GH in Equation (23). The normalized match scores of synthesized face images of the gallery and probe and the normalized match scores of gaits of the gallery and probe from the same video clips are fused based on different match score fusion techniques. Let s n jF and s n jG be the normalized match scores obtained from a specific video clip for the face and gait, respectively. The unknown test subject is classified to class C if the fused match score corresponding to the class C is maximum compared to all other classes in the gallery: FR{s n CF , s n CG } = max FR{s n jF , s n jG }; j ∈ (1, 2, . . . , N) where FR{, } represents the fusion rule, and N represents the number of enrolled individuals in the gallery. In this paper, we use Sum, Max, and Product rules. The results of the fused multimodal recognition are presented in Table 4. We can see that the fusion based on the Sum rule of the frontal gait and the LR face results in the best recognition accuracy.

Parameter Selection for the CNN Super Resolution
We tested the performance of the proposed LR face recognition with different parameters of the convolution neural network. The number of layers in the CNN network is varied between 3 and 5, where the best performance was obtained when using 3 layer architecture. The recognition-based super-resolution algorithm has three distinct steps, which signifies the optimal performance of the CNN with 3 layers. Experiments are conducted by varying the numbers of filters n 1 and n 2 (refer to Equations (9) and (10)) of the CNN architecture. Three sets of network parameters were used for experimental purposes (n 1 = 32 and n 2 = 16), (n 1 = 64 and n 2 = 32), and (n 1 = 128 and n 2 = 64). The best performance was achieved with the parameters (n 1 = 128 and n 2 = 64). The Super resolution restoration speed decreases with the increase of the size of the filters. To obtain a reasonable trade off we set the number of the filters n 1 and n 2 to 128 and 64, respectively. Moreover, the size of filters f 1 , f 2 , and f 3 (refer to Equations (8)- (10)) are varied between (9, 1, 5), (9,3,5), and (9,5,5). The best accuracy and performance trade off was obtained using the parameter values of f 1 = 9, f 2 = 3, and f 3 = 5. With the above-mentioned parameter settings, 8 × 10 8 iterations of backpropagations were needed to achieve convergence.

Conclusions
We proposed a system for highly accurate multimodal human identification from lowresolution video surveillance footage through LR face and frontal gait recognition using a single biometric data source, i.e., frontal walking surveillance video. Using the trained Adaboost detector, we automatically detect the LR face images. The frontal gait binary silhouette's are segmented using the fast object segmentation algorithm. We proposed an approach for accurate identification of the gait cycles in the entire gait video clip using only frontal gait information, then we extract the average movement speed and the shape feature. The detected LR face images are pre-processed using super-resolution techniques to obtain the high-resolution representation. This is followed by illumination and pose normalization, and image synthesis through registration. Finally, Gabor and LBP features are extracted from the synthesized face images. The nearest neighbor classifier is used to obtain modality specific rank-1 recognition for each modality. Then, the individual recognition results are fused through the score level fusion. The results indicate that combining the LR face and the frontal gait modalities produce the best recognition Rank-1 accuracy compared to the performance of each modality.