Card3DFace—An Application to Enhance 3D Visual Validation in ID Cards and Travel Documents

: The identiﬁcation of a person is a natural way to gain access to information or places. A face image is an essential element of visual validation. In this paper, we present the Card3DFace application, which captures a single-shot image of a person’s face. After reconstructing the 3D model of the head, the application generates several images from different perspectives, which, when printed on a card with a layer of lenticular lenses, produce a 3D visualization effect of the face. The image acquisition is achieved with a regular consumer 3D camera, either using plenoptic, stereo or time-of-ﬂight technologies. This procedure aims to assist and improve the human visual recognition of ID cards and travel documents through an affordable and fast process while simultaneously increasing their security level. The whole system pipeline is analyzed and detailed in this paper. The results of the experiments performed with polycarbonate ID cards show that this end-to-end system is able to produce cards with realistic 3D visualization effects for humans.


Introduction
In recent times, security issues have become fairly prominent in our daily routines, not only digitally but also visually. The necessity to authenticate an individual in order to grant access to a restricted area or task is now common practice. The identity verification that represents a confirmation that a given identity as real and that the individual claiming the identity is entitled to it has become of major importance. Generally, identity verification is required when there is a risk associated with dealing with the wrong person. The level of confidence in identity claims depends on the risk related to incorrect identification and in the liability distribution among involved parties.
The usage of a face image in authentication cards and travel documents is considered the simplest and the most common method, thus rendering it the first step in the forging of a document. Consequently, the counterfeiting techniques for forging documents have also increased and improved, especially regarding visual authentication, which represents an easier way to forge documents. The use of techniques to make this task more difficult or even impossible has attracted significant attention.
Regarding facial recognition, for instance, recent technological advances, boosted by deep learning architectures, have already demonstrated their ability to solve the problem of recognizing a person from a single photo. Systems based on architectures such as Arc-Face [1] or CosFace [2], for instance, have recently presented an extremely high confidence level in predictions, even outperforming a human operator. On the other hand, the Face Recognition Vendor Tests (FRVT) 1:1 and 1:n challenges, continuously promoted by the National Institute of Standards and Technology (NIST), reveal a high variety of commercial and non-commercial solutions for the facial recognition problem. The research topics in this area are now becoming particularly focused on the fairness of the population distribution in terms of gender, skin color, ethnicity and other characteristics of the subjects [3,4].
Furthermore, in the context of using facial recognition for security purposes, new approaches can either improve the security of the portrait photo [5] by providing an embedding of the face in the document itself or can take advantage of the three-dimensional information of faces. Particularly, in [6], it is shown that 3D facial recognition can achieve more accurate and reliable recognition results by exploring the inherently 3D shapes of faces.
Understanding the expressions of humans is also important for the issuing of ID and travel documents. In fact, document photos must be free of facial expressions to be compliant with the standards [7,8], which helps in the recognition task. In [9], an overview on this topic is presented, particularly focusing on 3D faces.
As a result, many systems currently lean towards facial recognition with liveness detection, which is designed for automatic decision systems such as airport gates or at the entrance of buildings, for instance, although there is still a strong need for physical documents such as ID cards and travel documents (sovereign issued documents) or civil ID cards for commercial and business purposes. The counterfeiting of such documents places considerable pressure on issuing entities to improve security and prevent presentation attacks.
Consequently, portrait photos in ID and travel documents have been used in different formats and secured with different elements and technologies. The main photo of these documents is usually plain, sometimes secured with hidden elements as in the IPI TM solution [10] or the Lasink TM solution [11], to name only a few of the commercial solutions adopted by the industry. Additionally, the portrait photo of the citizen is often stored in the document's chip or in a secondary photo printed with any digital transformation or even in a lenticular structure in polycarbonate cards, as in the CLI/MLI TM solution [12].
Particularly in this last element-the CLI/MLI (Changeable or Multiple Laser Image) element-the portrait photo is personalized in the polycarbonate card in a lenticular structure that creates a two-layer image effect, usually with the portrait photo in one layer and an alphanumeric string or code in the other layer that is visible from a different viewing angle.
In this paper, we are particularly interested in the elements printed in lenticular structures and in the three-dimensional visualization effect that appears when the photo is personalized on the card surface, under the lenticular structure, through the use of photos generated from a 3D model of the person's head. This set of generated images (head views) is intended to be printed on cards with a vertical lenticular structure, providing a 3D visualization effect and thereby allowing a more accurate visual face validation. Besides the improvement in visual validation, this 3D effect in the card also makes it more difficult to forge.
To build the 3D model of a face, several techniques can be used, depending on the type of cameras or images available. Although our technique can be applied to a setup with multiple cameras, for instance, our system is focused on 3D cameras, either using light-field, time-of-flight, stereo or structured light technologies, as long as the information can be obtained from a single shot. This topic is important for our systems as, in the context of ID and travel documents, the citizen's photos are usually obtained from portals or single camera setups as it is often not practical to have a multi-camera setup (thus also avoiding synchronization and alignment problems).
There is, however, a drawback concerning occlusions when using a single camera. In fact, some parts of the face are not seen, which limits the reconstruction of the 3D model. In [13], the authors present an overview of occlusion detection and restoration techniques for 3D models of faces in the recognition context.
Our application was thus named Card3DFace, and it is the outcome of an innovation project by the University of Coimbra and the Portuguese Mint and Official Printing Office (Imprensa Nacional-Casa da Moeda-INCM), the manufacturer of ID card and travel documents in Portugal.
Most of the available systems often make use of multiple facial images, sometimes taken simultaneously, as inputs. They also tend to use complex pipelines for model building and fitting [14]. This application is aimed to be an easy and rapid way to provide a solution for the use of 3D face images in ID cards and travel documents for authentication. Since our solution is based on single image acquisition (which is much more convenient for acquisition portals and sites), it does not require very sophisticated equipment for the acquisition process. In effect, the solution relies on an affordable 3D camera. This solution is designed to be integrated into the process of producing ID cards and travel documents.
In summary, the motivation of our initial study was the research and development of an end-to-end application to produce polycarbonate ID and travel documents with visualization effects using lenticular lenses. In the current article, we present not only the filtering of 3D models, as presented in the author's previous work [15], but indeed the whole system, from the acquisition of images to the printing of the cards, by analyzing each of the pipeline phases and by discussing the available technology and options taken. This article thus seeks to be an inspiration for other engineering articles in the industry of security printing and the authentication of persons. This paper has the following structure: in Section 2, we mention work related to the reconstruction of faces and RGB-D characteristics; Section 3 introduces the system characterization, with an explanation of the main steps of the proposed system; Section 4 demonstrates the interface and its functionalities; in Section 5, we present the results obtained with the system; finally, Section 6 sheds light on the main conclusions.

Related Work
As we present an end-to-end system to produce identity cards with the 3D visualization effect in this article, our main contribution is related to the way that pieces are put together to produce the expected visualization effect. This related work section is thus organized according to the most important research topics, while disregarding some of the less relevant engineering aspects.

Three-Dimensional Face Reconstruction
Three-dimensional face reconstruction is the task of reconstructing a human face from an image into a 3D form (or mesh). Over the last few decades, researchers have expended great efforts on this matter due to its challenges and its huge applicability. As a result, substantial progress has been made, which has led to novel and powerful algorithms that obtain impressive results, even in the very challenging case of reconstruction from a single RGB or RGB-D camera. According to Zollhöfer et al. [16], unlike RGB cameras, RGB-D sensors capture both color and depth data at real-time rates. This helps to solve the inherent depth ambiguity of the monocular reconstruction problem, since a coarse geometry estimate is available at least. The use of an RGB-D also enables a more reliable and realistic reconstruction of the face.
The approaches that work with RGB-D as an input with depth sensors can be classified as passive or active devices. Passive depth sensors are most commonly implemented via a stereo camera setup. If a point is found in both views of a calibrated stereo setup, the 3D point can be reconstructed through triangulation. Active cameras work with a light projector. These structured light cameras are widespread (e.g., Microsoft Kinect, Primesense Carmine, Intel Realsense) and provide relatively good depth in the near range, which is important for face-tracking tasks. Time-of-flight (ToF) cameras are another type of active depth camera. These RGB-D cameras compute depth by measuring the round-trip time of a light pulse (e.g., Creative Senz3D or Microsoft Kinect One).
In the work proposed by Thies et al. [17], one of the main contributions is a new real-time algorithm to reconstruct the high-quality facial performance of each actor in real-time from an RGB-D stream, captured in a general environment with large Lambertian surfaces and smoothly varied lighting. The proposed method uses a parametric face model that spans a PCA space of facial identities, face poses and corresponding skin albedo. This model, which learns from real face scans, works as a statistical prior and an intermediate representation, which finally enables the photo-realistic re-rendering of the entire face.
Hsieh et al. [18] proposed an approach that unites facial tracking, segmentation and tracking model personalization from an RGB-D model. They detect dynamic occlusions caused by temporal shape and texture variations using an outlier voting scheme in superpixel space. The model demonstrates robust and high-fidelity facial tracking on a wide range of subjects with highly incomplete and largely occluded data.
Another interesting work that uses RGB-D as input was proposed by Bouaziz et al. [19]. This work demonstrates that online model building can replace user-specific training and manual calibration for the facial performance of capture systems while maintaining high tracking accuracy. It simply requires a low-cost 3D sensor and no manual assistance of any kind. The authors introduced an adaptive dynamic expression model that, in turn, combines a dynamic expression template, an identity PCA model and a parameterized deformation model in a low-dimensional representation, which is suitable for online learning.
One of the main problems with RGB-D cameras is the intrinsic noise that can affect the final result in face reconstruction. Filtering is a solution, but in some cases, it may not provide a satisfactory result. A face has some particular properties; thus, it is important that the filtering preserves them so as not to defeature the reconstruction. In this sense, we presented a filter approach based on exemplars [15].
In recent years, several filtering approaches have been proposed to process mesh geometry, denoising and smoothing meshes, targeting the output from 3D scanners or RGB-D cameras. The image processing literature considers that methods of mesh denoising may be classified as isotropic or anisotropic [20,21]. On the one hand, isotropic methods are independent of surface geometry; that is, they normally remove noise and high-frequency features together. For example, low-pass filters were some of the first models proposed to treat meshes. These filters remove high-frequency noises but also smoothen sharp features [22,23]. Isotropic methods therefore have difficulty in preserving geometric features. On the other hand, anisotropic filters are based on anisotropic geometric diffusions, which are inspired by scale space and anisotropic diffusion in image processing [24]. They are often needed to preserve features such as sharp edges and corners.
In our case, the main objective is to present a robust filtering model that keeps the information coherent and reliable even under adverse circumstances. This correction process is similar to that which is solved in texture-based synthesis techniques [20].
To model the specific deformation of 3D human faces used for novel view synthesis, which is necessary to print a lenticular card, one possible method is presented by [25]. In this method, the reconstructed visual hull from the shape-from-silhouette approach is used to refine the 3D model by iterating for photo-consistency, image contour and surface smoothness.
Furthermore, a very recent work [26] presents a stereo camera based on a new sensor that particularly explores the rotation of an image sensor and the parallax generated by the stereo pair of images. This work is also suitable for use in Card3DFace, as well as conventional consumer stereo cameras, since their small baseline can generate the 3D reconstruction of faces.

Head Model Reconstruction and Filtering
Fields such as computer graphics or geometric modeling are quite advanced regarding representation and mesh processing. Some of the wide variety of modeling techniques that represent data structures of meshes focus on faces (Face Set and Shared Vertices) [27], and others focus on edges (Winged-Edge and Half-Edge) [28].
Face-based representations are considered the simplest kind of representation and are implemented in the most common file formats, such as OFF, OBJ and STL. However, they do not provide any connectivity information for triangles; thus, edge representations present themselves as a more complete strategy for storing a mesh geometry, such as the position of the vertices, the incident faces on an edge and the vertices that make up a face.
As for the filtering, it represents a phase where the mesh points that render the mesh irregular and defective are smoothed. Since 3D camera-acquired models are invariably noisy, the process of mesh smoothing is of great importance. There are several techniques in the literature used to obtain results for denoising [21]. Some of these techniques are based on the use of filters, such as the Gaussian filter and the Savitzky-Golay filter, among others. However, these filters are generic, which for faces that have intrinsic characteristics for the model itself may lead to distortions such as the flattening of the nose or deformation of the mouth and eyes.

System Building Blocks
Card3DFace is a system based on four main steps that address the acquisition, modeling, generation of face views and printing phases, as illustrated in Figure 1. The first step seeks to obtain an image of a person's face based on a single-shot image acquisition. The second phase, modeling, includes two specific steps: the reconstruction and filtering processes. This phase is responsible for transforming the acquired image into a 3D model. The generation of head views corresponds to the third phase, where perspective images of the generated 3D model are obtained and used for printing in lenticular cards, which corresponds to the fourth step. In this section, we focus on the four steps of the system pipeline separately.

Acquisition
For this application, it is necessary to use cameras that are capable of obtaining not only the visual characteristics but also the geometric information of the scene either directly or indirectly. In this case, the scene corresponds to faces positioned in front of the camera so that information related to depth can be extracted. In the development of this application, three types of cameras were considered and studied, namely plenoptic cameras (also known as light-field cameras), time-of-flight cameras and stereo cameras.

Studied Camera Types Technologies and Selection
Plenoptic (or light-field) cameras present a different architecture from conventional cameras. This difference lies in the fact that conventional cameras are composed of the main lens and an image sensor, whereas plenoptic cameras have a microlens array between the image sensor and the main lens of the camera. This microlens array allows the light field to be captured from various points of view, forming a 4D light field with a 2D image, so that it is possible to estimate the depth of the scene [29]. This estimate of the depth is obtained by using the redundancy created by multi-view geometry, where a 3D point is projected onto the image several times.
There are two kinds of plenoptic cameras: standard and multifocus. These two cameras were studied using the Lytro Illum camera and the Raytrix R42 camera, respectively. The difference is mainly the focal length of the lenses. The focal length of the microlenses, in a standard plenoptic chamber [30], corresponds to the distance between the image sensor and the array of microlenses; thus, all microlenses have the same focal length. Each lens contributes only one pixel value to the final image, with the resolution of this image equal to the number of microlenses. This feature drastically reduces the image resolution compared to sensor capacity; however, the computational power required to process the image is also lower, making it suitable for compact cameras. An example of this type of camera is the Lytro Illum in Figure 2 (left). On the other hand, the multifocus plenoptic camera has the microlens array placed in front of the image sensor, where each microlens has a different focal length from its neighboring lenses, which are thus classified as different lens types. This design allows a better combination between effective resolution and depth of field size, resulting in higher-resolution images. An example of a multifocus plenoptic camera is the R42 camera model developed by Raytrix GmbH, presented in Figure 2 (right). Since they are multifocal (there are micro lenses with three different focal lengths), these cameras have additional calibration issues. As such, there is a large number of parameters that need to be adjusted, making it disadvantageous.
Two cameras of the time-of-flight camera type were also studied: the DepthSense325 and the RealSense (Figure 2). These camera types employ time-based imaging, processing distance estimation based on the speed of light by measuring the time spent in the round trip for a light signal between the camera and the object for each image point. These cameras use an artificial light (provided by a laser or LED) to estimate this distance for each point in the image. They are also affordable and relatively easy to use. This type of camera, due to the simplicity of its use and the results presented, was selected to be included in this application.
The technology of stereo cameras was also taken into consideration for the defined purpose. A stereo camera is a type of camera that possesses two lenses, with an image sensor on each lens. This camera calculates the disparity between the two images at different positions, allowing them to simulate human binocular vision. This provides the ability to capture three-dimensional images-a process known as stereoscopy that can be used to create 3D images. The camera employed for the tests was the Stereo ZED, shown in Figure 3. For this camera, the acquired images revealed that the distance between the camera and the object was not adequate for face images, as the minimum distance for acquisition was larger than the distance used to capture a face image, hence the reconstruction process that estimates the depth parameter did not display reliable values.

Set-Up for Image Acquisition Conditions
Light represents one of the major components in the acquisition of an image, and it directly interferes with the obtained results. In order to improve image quality, which is highly important for the reconstruction of the 3D model, it is necessary to consider artificial light in the scene. Tests have been conducted that take this parameter into consideration; namely, acquisitions made with different light conditions, both artificial and natural. For the artificial light, we used two softboxes with 5500K lights, which are commonly used for photography lighting.
It is also necessary to include the background factor, which needs to be homogeneous, in accordance with the general recommendations of international standards for ID and travel documents. The use of light and a uniform background can enhance the model reconstruction owing to the fact that it decreases the existence of outliers in the reconstruction process. The tests with variations in lighting and background are presented in Figure 4, where one can easily see the importance of having good lighting and background conditions.

Modeling
To reconstruct the model, it is necessary to obtain a mesh that represents and stores the facial data structure obtained in the acquisition. For this data storage, the volume of data needs to be taken into account as well as the ease of handling it. As the aim of this work was to obtain a mesh that represented the data structure of the acquired face, the use of an RGB-D data structure was considered an advantage. This structure stores not only the color information of each pixel (RGB) but also the depth information (D) in the scene. The RGB-D structure has shown great advances in scene reconstruction in terms of algorithmic concepts and with respect to different application scenarios [16]. Figure 5 represents the whole modeling process of one sample image. The process starts with the acquisition of the RGB image, followed by the depth map estimation, then the 3D reconstruction and the mesh construction.
This model generation is thus divided into two main steps: reconstruction and filtering.

Reconstruction
In the reconstruction from the initial image acquired by a camera, an RGB-D structure is generated. Information is stored regarding the 2D image (color at each point in the point cloud), as well as a point cloud according to the captured 3D scene (spatial position x, y, z). This information is used to construct the model.
The reconstruction of the face is based on the point cloud P obtained from the input device (camera). These points with a spatial position of x, y, z need to be aligned and organized into a mesh structure that can be easily manipulated and visualized to obtain the head views. The first step of reconstruction consists of a preprocessing step that is conducted to eliminate the outliers. These outliers are generated due to noises or lighting problems at the time of acquisition. The removal of these outliers is an approach based on the distance of the face to the camera and its depth. In order to obtain the position of the face and verify which points belong to it or are considered outliers, we use the facial landmark detector proposed by [31]. It estimates the location of 68 (x, y) coordinate pairs that map to facial structures on the face. These landmarks can be visualized in Figure 6.
After estimating the landmarks, we calculate a circular region Cn around the face's nose (red circle in Figure 6 (left)). The Cn center is given by the x and y coordinates of the 31°landmark calculated according to [31], and the Cn radius corresponds to the distance between the 30°and 31°points.
where d i is the depth of P i , and τ is the threshold calculated based on the depth of a human face. After eliminating the outliers, the remaining points P are rearranged to a regular mesh. Firstly, we compute a triangular mesh using a Delaunay triangulation. Hence, we recreate the depth map by matching a surface of the form Z = F(X, Y). The inputs X and Y are 2D grid coordinates based on the coordinates contained in the vectors x and y from P. The grid is represented by the coordinates X and Y, with length(y) rows and length(x) columns. X, Y and Z are the new coordinates of a regular mesh that represents the 3D surface of the face.  The texture of the model is mapped using UV mapping that projects the 2D image obtained from the photo to the 3D surface.

Filtering
The next step in the system pipeline is the filtering, which is necessary to smooth the noisy 3D reconstruction. We thus developed a specific filter for face meshes: a contentaware filter for RGB-D faces that proposes the smoothness of each point of a given mesh by comparing the local neighborhood using a set of exemplars [15]. This filter consists of an exemplar-based neighborhood matching, where all models are in a frontal position, avoiding rotation and perspective. We take advantage of prior knowledge of the models (faces) to improve the comparison. We first detect facial feature points, create the point correctors for regions of each feature and only use the corresponding regions to correct a point of the filtered mesh. As a result, the model is invariant to depth translation and scale. The proposed method is evaluated on a public 3D face dataset with different levels of noise. The results show that the method is able to remove noise without smoothing the sharp features of the face. Figure 7 illustrates the proposed filtering method used [15].
The filtering model comprises two main steps: the model standardization and the filtering itself. The goal of the model standardization is to allow different scales and sampling frequencies to be handled due to the distinct acquisition processes-notice that here we propose a system that can handle different types of cameras. The model standardization process consists of changing the frequency sampling of a given model, named target Π, according to a base model, named exemplar Ω. In cases when we use more than one exemplar to define the filter, one of them is chosen to be the base, and the others are also resampled. Firstly, we use a set of facial feature points, also known as facial landmarks [31] (see Figure 6, left), to align and resample the faces. After defining the landmarks of the two models to be aligned, an Iterative Close Point (ICP) [32] algorithm is then computed. This method returns a scale s ∈ R, a rotation R(2 × 2 matrix) and a translation c ∈ R 2 that when applied over the second model aligns it with the first model, thus minimizing the difference between the two point sets. It is an affine transformation that can be represented in homogeneous coordinates by the following matrix: After the alignment, we can perform the target resampling process. This consists of creating a rectangular grid of target points, named ∆ = {(x kl , y kl , z kl , u kl , v kl ); k = 1. . . M, l = 1. . . N}. It is performed by the definition of the coordinates XYZ and UV of M × N points (∆ dimension) taken regularly into the target texture space. The resampling is based on a triangulation of points of Π in UV space.
The definition of M and N (∆ dimension) is based on the target FFPs transformed into the exemplar texture space. Each target FFP (into the target texture space) is multiplied on homogeneous coordinates by Γ. An oriented bounding box is created around these transformed points. Finally, M and N are the dimensions of this box.
Once these dimensions are defined, it is necessary to create the M × N points of ∆. The resampling starts with texture (samples inside the face bounding box in target texture space), and for each sample, we need to define the respective XYZ and UV coordinates. It is also necessary to define (j, i) for each sample into target texture space and then the respective (u, v) = ψ −1 (j, i). The coordinates (x, y, z) are obtained based on this (u, v). We create a Delaunay triangulation of the UV coordinates of all points of Π and detect the triangle composed of p a , p b and p c that contains (u, v). It is noteworthy that p a , p b , p c ∈ Π and that they have XYZ and UV coordinates. Let λ a , λ b , λ c ∈ [0, 1] be the respective barycentric coordinates; then, the XYZ coordinates of this point are given by (x j,i , y j,i , z j,i ) = λ a (x a , y a , z a ) + λ b (x b , y b , z b ) + λ c (x c , y c , z c ). Therefore, this completes all coordinates of ∆ points. After the resampling, we perform a filtering process that consists of modifying the Z value of the target points. This is achieved through a neighborhood comparison between that point and the neighborhood of equivalent points in the exemplar. In this step, both the target and sample are regular grids at the same sampling frequency. This phase is divided into two parts: the Predictor Definition and the Correction Process.
The filter is a Nearest Neighbor Predictor whose input is a neighborhood of a target point containing k × k Z-values (centered at this point), and the output is the respective normalized Z-value of the central point of the closest neighborhood from exemplars (normalization is explained below). We create one predictor per FFP region, and each one is trained by using all neighborhoods in all exemplars that belong to the respective FFP region.
The definition of the FFP regions is given by a Voronoi Diagram. For each exemplar, a Voronoi Diagram of all FPPs is created ( Figure 6 on the right), and for each region, in turn, all points inside it are used to train the respective predictor.
Once the points per region are defined, it is necessary to achieve a normalization of each neighborhood by subtracting the average and dividing by its variance. This guarantees that all neighborhoods can be compared, since all of them are at the same scale and depth translation. The next step is the correction process, which consists of modifying the ∆ points position according to the predictor. For each point p ∈ ∆, (i) we determine its respective region (FFP), (ii) we obtain its neighborhood and normalize it (with the respective mean and variance), and finally (iii) we apply the projection to the base of the PCA. We use the normalized and reduced neighborhood in the prediction process. The predictor returns the normalized Z-value of the central point according to the best-matching neighborhood. Therefore, we take this value and multiply it by the variance and add it to the average of the p ∈ ∆ neighborhood. The normalization of the exemplar and target neighborhood ensures that we can compare them irrespective of scale (division by variance) and translations in depth (subtraction by the mean). In addition, it is noteworthy that a neighborhood of the exemplar is normalized with its mean and variance, but the process of denormalization is performed by using the mean and variance of the neighborhood of the target that is being corrected. Thus, we transfer the neighborhood feature of the exemplar to the target, with the invariance mentioned above. Figure 7 illustrates this step.

Head Views
The process for generating head views consists of rotating the previously created model around a vertical axis between the eyes. Depending on the number of head views intended to be generated, a value for the angular rotation is defined for both sides of the face. The application presented here allows a selection of between 5 or 7 generated head views, corresponding to angles of 9°and 6.5°, respectively. These values were defined based on the balance between the alignment of head views and the smooth and realistic transition between them when printed on the lenticular card. It is necessary to avoid a jump in the image when we rotate the card to visualize the 3D effect.

Lenticular Printing
The different head views generated are meant to be printed on lenticular cards, thus providing the desired 3D effect. The lenticular printing technique is a process that has been widely used to produce optical effects such as 3D perception or image flipping. Lenticular technology involves exhibiting numerous sets of images, which change when being viewed from different angles. The effect is created when the viewer sees the image from a slightly different viewpoint with each eye. The images to be viewed are sliced into strips and interlaced with each other (Figure 8). In a specific scenario, the printing process consists of the usage of a sheet of a cylindrical lens array placed on top of a high-resolution LCD in such a way that the LCD image plane is located at the focal plane of the lenses [33]. The printing techniques for lenticular lenses are still considered to be of interest in the reconstruction of a 3D image and are still used in commercial applications with advantages in terms of their low-cost and easy fabrication [34]. In our case, the printed depth in the lenticular lenses was established at 450 µm, as this depth displayed a more realistic effect of the 3D image in experiments.

Application Interface
The Card3DFace application was developed in the C++ language and Qt environment, which is widely used for developing multi-platform user interfaces that can run on desktops and mobile devices. This application presents four main areas that encompass the controls (1 on Figure 9) and the three principal stages of the model (2, 3 and 4 on Figure 9). It possesses a block for image acquisition, a block for the generated model and a block for rendering the head views for printing.
The controls for the application are gathered in a single ribbon at the top. In this command bar, we can select the tasks to perform; namely, uploading from a previously generated model (Figure 9a), camera acquisition (Figure 9b), filtering (Figure 9c), the selection of the number of head views (Figure 9d), the generation of the head views ( Figure 9e) and file storage (Figure 9f).
There is a defined procedure to follow when running this application. Firstly, it is necessary to load a previously generated model (a) or take a picture of a person's face (b). In the next step, it is necessary to apply the filtering technique (c), and after defining the quantity of head views to be generated (d), we can proceed with the generation of head views (e). There is also the possibility of saving the obtained images (f) for later use for printing.

Experiments and Results
The experimental setup for the Card3DFace system was composed of a rig for the portrait acquisition and a polycarbonate personalization machine (printer). The acquisition rig was composed of several 3D cameras (plenoptic, stereo and time-of-flight), a photographic studio with soft-boxes and a homogeneous background and the software application. Regarding the personalization of the cards, this was achieved with a specialized printer with laser technology, which is commonly used for the personalization of Portuguese ID cards at the Portuguese Mint and Official Printing Office (INCM).
After the acquisition and modeling processes, the generated head view images could be used for printing on the lenticular cards. The calibration of the personalization machine was conducted by expert operators. This process involved the calibration of the exact position of the laser beam reaching the lenticular lenses and the exact calibration of the angle of the card holder with respect to the laser beam. This calibration guaranteed that the generated view images were personalized on the card surface below the lenticular lenses. As this calibration process depends on the specific printer and its inner design, its description does not fall within the scope of this article.
We present some examples in Figures 19-23 of the head views generated for printing, and an imprinted card prototype example is presented in Figure 25. Figure 24 shows the same card viewed from different perspectives. Notice that the printed 3D face will commonly have small dimensions and usually occupies a reduced area of the card and travel document.
Before evaluating the whole process, we describe some ablation studies made for the phases of the process.

Reconstruction Evaluation
Although the 3D reconstruction is relatively straightforward for each type of input camera, we review the cameras used and describe in greater detail the steps from the input image until the estimation of the mesh (Figures 10 and 11).  . Resulting depth map after roughness restriction [35] and cost volume refinement [36].

Three-Dimensioanl Reconstruction Model-Lytro Illum
To estimate depth with Lytro Illum cameras, we followed one of the most popular approaches in current research: the use of cost volumes [35,36]. A cost volume is a volumetric structure in which each layer represents the cost of assigning the pixels of the image to a given depth. The lower the cost, the more likely the pixel is to be at that same depth. In a way, one can set the cost volume C to where p is an estimate of the probability of a coordinate in the image {x, y} being at a given depth z. The advantage of using a cost volume instead of a "probability volume" is associated with the fact that depth estimation can be treated as a minimization problem. For a given image, we generated the cost volume using the method [35] and its method of post-processing the volume. Then, we applied our roughness restriction to further refine the results. In this method, given an infinite-dimension C cost volume on the depth axis, z, the l depth is estimated by selecting the lowest cost layer for each pixel: l x,y = argmin z Cx, y, z The roughness constraint is defined as a modification in the depth solution: l x,y = argmin z C x, y, z = argmin z Cx, y, z + λR(x, y, z) where λ is a number in the range [0, 1] and R(x, y, z) is the constraint function. The restriction encourages each pixel to be placed near its neighbors. Given a kernel of neighborhood N and size n, the roughness restriction is given by where G is the incentive function for a given depth t. Using inverted Gaussian distributions, G is defined as Finally, the depth of the image was reconstructed from the final finite volume, using an existing common method-approaching parabolas. The lowest cost layer for each pixel was found, l x,y , and then it was approached as a second degree polynomial at the cost of the l x,y layer and its neighboring layers l x,y+1 and l x,y−1 . Finally, the depth of the pixel d x,y was set as the minimum of the polynomial. This allowed solutions to be found at points between layers, resulting in smoother and more accurate predictions.
Additionally, some cost volume refinement could be performed by applying the method of [36] in order to improve the depth results. As mentioned by the authors, the refinement in a prior stage (cost volume) is beneficial in terms of final accuracy over the refinement of the depth map.

Three-Dimensional Reconstruction Model-Raytrix
The methods presented by Ferreira et al. [37,38] to estimate the depth map of an image of a multi-focus plenoptic camera can be used with the Raytrix camera. This method takes advantage of the different focal lengths to perform ray tracing in order to obtain depth (back-projecting the pixels to the array of microlenses). The method begins with finding salient points and their matches in neighbor microlenses using a scaled value from the sum of absolute differences. To perform the ray tracing and to obtain more robust results, the method filters the noisy results using a RANSAC approach, eliminating unwanted results.
Raytrix cameras have additional calibration problems as they are multifocal. As such, there are several parameters that need to be adjusted. The main objective during our vision system configuration was to determine the best camera setup for estimating a depth map, including both the photographic environment and calibration parameters.
The results obtained the method of Ferreira et al. [38] are presented in Figure 12, and a 3D reconstruction example using the software of the manufacturer Raytrix is presented in Figure 13.

Three-Dimensional Reconstruction Model-Time of Flight
The 3D reconstruction model for images obtained by the DepthSense camera-a timeof-flight (ToF) camera-was an alternative 3D reconstruction solution to models using plenoptic cameras. In addition to presenting itself as an alternative solution, with this model, we aimed to verify the quality of 3D reconstruction using a low-cost camera. For this camera type, we opted to use the software from the manufacturer.  The image acquisition obtained by the DepthSense camera of a given scene is described in two files: a file that stores a 2D image of the scene and a file that describes a point cloud according to the captured 3D scene. These two files were used in the initial phase (input) of our proposed method: a file that stored the 3D information (x, y, z) for each point in the cloud and a file that stored the color of each point of the point cloud. Figure 14 illustrates the data initially captured by the DepthSense camera. In the next step, outliers that corresponded to noisy points that had been improperly captured by the camera were removed. Based on the remaining points, a regular mesh was created with the aim of constituting the mesh of the 3D reconstruction. As previously stated, filtering was the following phase, as described in the next subsection.

Three-DImensional Reconstruction Model-Stereo
As previously stated, the stereo camera used in our system was a ZED camera. However, since the work distance of this camera was not appropriate for face acquisition in a studio environment, we opted to discard these experiments.
Nonetheless, stereo cameras are generally suitable for the 3D reconstruction of scenes and consequently for the estimation of meshes of the reconstructed scene.

Filtering Evaluation
In order to evaluate the results of the filtering, we performed experiments by applying white noise [39] to the set of different models. We varied the noise intensity and compared the results with Bilateral [40] and Gaussian filters. We then calculated the mean-squared error (MSE) [41] between the noisy models and filtered models. Table 1 shows the quantitative errors on the models used in this experiment. Our method presents the lowest error compared to other filters and preserves the details of the mesh (sharpness). These sharp details can be seen in Figure 16d. Table 1. The quantitative errors of the models used in this experiment. The first column is the noise level. Columns 2, 3 and 4 are the results of the mean-squared error (MSE) [40] between the noisy models and filtered models. Additionally, Figure 17 illustrates the correction of a mesh obtained using the Depth-Sense camera [23]. We first standardized the scale and sampling frequency with relation to the database, and then we corrected it using our filter. Figure 18 demonstrates the results obtained by our filter on different models. Column (a) illustrates the model with texture, column (b) shows the ground truth of the mesh obtained by a 3D scanner, (c) is the same mesh after noise, (d) shows models filtered by our method without texture and (e) illustrates the textured filtered models.
The use of the Bosphorus Database [42] allowed us to use these models as a ground truth. We used 20 randomly chosen models as examples. Future work may involve the determination of the minimum amount of examples that minimize filtering error. Reducing the amount of neighborhoods per FFP region (by removing intraclass redundancy) is left for future work.
Finally, our filtering approach was based on a division of the model into regions in which all points have an intrinsic geometric similarity. We presented how to define these regions for the specific case of faces with the usage of facial features in Section 3.2.2. A future research direction would be to define general descriptors that can be used for general-purpose filtering. (a) (b) (c) (d) Figure 17. The correction of a mesh obtained using DepthSense camera [43]. We first standardized the scale and sampling frequency in relation to the database, and then we corrected using our filter. Figure (a) shows the acquired noisy mesh, (b) is only the texture, (c) is the filtered mesh and (d) is the filtered mesh with texture. Figure 18. The results obtained by our filter on different models. Column (a) illustrates the model with texture, column (b) shows the ground truth of the mesh obtained by a 3D scanner, (c) is the same mesh after noise, (d) shows the models filtered by our method without texture and (e) shows the textured filtered models.

Datasets
As mentioned before, the 3D model dataset used for the image synthesis was the Bosphorus Database [42]. This dataset allowed us to test and improve the last two steps of the pipeline: the rendering of new views and the printing phase.
As for the first two steps of the pipeline-the acquisition and the modeling (3D reconstruction and filtering)-we used an dataset built in-house and comprising 20 persons. For the filtering phase, we also used the Bosphorus dataset to provide the database of exemplars, as described previously.  Figures 24 and 25) clearly shows the threedimensional visualization effect that was expected to be produced in the viewing of the cards.

System Evaluation and Discussion
The Card3DFace application offers a fast and affordable way to increase security in authentication cards and travel documents. It does not need highly sophisticated camera equipment for 3D image effect creation and does not use proprietary or sophisticated software packages, allowing for the easy implementation of this application. The results present themselves as reliable and adequate, providing a 3D effect of the face in lenticular cards through the laser printing techniques of the respective head views generated.
It is worth noting that the angle difference in the generated head views is relatively low, although it is enough to create the 3D visualization effect. This angle difference is usually up to ±20 degrees in relation to the frontal view of the face. Higher angle differences are not allowed due to two factors: on the one hand, the most important factor is the limitation of the viewing angle to keep the visibility of the card surface by a laser beam through the lenticular structure; on the other hand, the laser personalization machines themselves have physical limitations on the rotation angle of the card, which varies from machine to machine. Additionally, the 3D model of the head is also limited as it is reconstructed from a single frontal photo for practical reasons. Despite these limitations, the results demonstrated that a good 3D visualization effect is created in the printed cards, which validates the approach to this technology and its use to secure a person's authentication using ID and travel documents.
Furthermore, due to the fast development and latest breakthroughs in the smartphone industry-devices which are very likely to generally have time-of-flight technology in their cameras (Huawei, Apple, Honor and others) in the future-reflecting on the possibility of using this application in a smartphone as well is relevant. In future work, we should also consider improvements in adapting some techniques related to the removal of radial distortion in images.

Conclusions
In this paper, we presented an application that is capable of producing different views of a person's head and face based on an image from a single-shot acquisition. These views are meant to be printed on lenticular cards, thus providing a 3D visualization effect and a sense of depth of an individual's image.
In our view, this application presents itself as a considerable and important step to achieving a higher level of security and improving the authentication capability of documents. By improving the 3D views of images on an ID card and travel document, we are intrinsically improving authentication control.
The system requires a 3D camera (several technologies are included in this specification) that is able to output an image from which we can reconstruct a 3D model, and these cameras are now affordable and common. Considering the specific type of image effect intended (3D) and the type of card used for printing the subject image area, this technology also represents an obstacle for forgery, in addition to the printing technology, which in itself is much more complex.
Author Contributions: Conceptualization and methodology, L.D., L.C. and N.G.; software, L.D. and L.C.; investigation, data curation and writing-original draft preparation, L.D. and L.C.; writing-review and editing, L.D. and N.G.; and supervision, project administration and funding acquisition, N.G. All authors have read and agreed to the published version of the manuscript.

Funding:
The authors would like to thank the Portuguese Mint and Official Printing Office (INCM) and the University of Coimbra for the support of the project Card3DFace. This work has also been supported by Fundação para a Ciência e a Tecnologia (FCT) under the project UIDB/00048/2020. Institutional Review Board Statement: Not applicable. The study involved the usage of the human face images, which were taken from the publicly available datasets and some inhouse datasets with persons that authorized the use of their face images in the study and article dissemination.

Informed Consent Statement:
The study involved the usage of the human face images, which were taken from the publicly available datasets and some inhouse datasets with persons that authorized the use of their face images in the study and article dissemination.