Three-Dimensional Human Head Reconstruction Using Smartphone-Based Close-Range Video Photogrammetry

Creation of head 3D models from videos or pictures of the head by using close-range photogrammetry techniques has many applications in clinical, commercial, industrial, artistic, and entertainment areas. This work aims to create a methodology for improving 3D head reconstruction, with a focus on using selfie videos as the data source. Then, using this methodology, we seek to propose changes for the general-purpose 3D reconstruction algorithm to improve the head reconstruction process. We define the improvement of the 3D head reconstruction as an increase of reconstruction quality (which is lowering reconstruction errors of the head and amount of semantic noise) and reduction of computational load. We proposed algorithm improvements that increase reconstruction quality by removing image backgrounds and by selecting diverse and high-quality frames. Algorithm modifications were evaluated on videos of the mannequin head. Evaluation results show that baseline reconstruction is improved 12 times due to the reduction of semantic noise and reconstruction errors of the head. The reduction of computational demand was achieved by reducing the frame number needed to process, reducing the number of image matches required to perform, reducing an average number of feature points in images, and still being able to provide the highest precision of the head reconstruction.


Introduction
Three-dimensional modeling of the human head has a wide range of applications. Three-dimensional data of the head, with extension to the whole body, are widely used in clinical, industrial, anthropological, forensic, sports, commercial, and entertainment areas. Medical applications of 3D scanning may be divided into four groups: epidemiology, diagnosis, treatment, and monitoring [1,2]. The 3D measurements can benefit cranial deformation studies [3][4][5][6], diagnosis, craniofacial information analysis [7], and evaluation of the effects of orthotic helmets [8]. Models and 3D visualizations allow measurements to be performed for planning a surgical intervention, assess surgical outcomes, measure changes after surgeries, forecast the result of a facial plastic/cosmetic surgery, document clinical cases, compare pre-treatment and post-treatment models [9], perform more accurate orthodontics diagnoses [10], and achieve better dental reconstruction results [11]. In biomedical engineering, anthropometrical measurements help to design prosthesss [12] and allow for the rapid prototyping of customized prostheses. The manufacturing of medical products has to be based on population anthropometrical studies so that medical equipment perfectly suits the physical characteristics of patients [13]. Head 3D modeling may be used for the documentation of research, registering EEG electrode positions [14][15][16], collection of anthropometric data [17][18][19][20][21], and defining normal head parameters [22]. Nonmedical fields of 3D head modeling applications include computer animation, movies and animation, security, teleconferences, and virtual reality, forensic identification [23], behavior research (perceptions of attractiveness), identifying human facial expressions, and sculpture [24]. Another large group of applications are found in industry: design of headwear products, such as helmets, headgear, glasses, and headphones [25,26]; optimization of wearable product comfort and function [27][28][29], perform better ergonomic design of human spaces, simulate the wearing of clothes [30,31], and create products that take into account ergonomics [27,32], model and predict respirator size and fit [25,33,34].
There are several types of imaging techniques to create 3D models: laser line systems [35], structured light systems [36], close-range photogrammetry [37][38][39], and radiowave-based image capturing systems [40]. Image-based reconstruction and modeling of scene [41][42][43], objects [44] and processes [45,46] is a widely accessible technique in terms of price for gathering information [1,47,48]. The complexity of usage of such technologies mostly depends on algorithms and user interface design. Three-dimensional objects may be reconstructed by fitting mathematical models to the collected image data [49,50]. However, the model is required, and it should be adequate to represent a range of variations the modeled object may possess. Therefore, the most popular technique for estimating three-dimensional structures from two-dimensional image sequences is Structure from Motion (SfM) [51][52][53][54]. The means of object modeling that is easily accessible to ordinary users is based on handheld devices, such as smartphones [55,56]. Smartphone-based close-range digital photogrammetry would be the desired way for modeling objects at home. Photogrammetry using ordinary consumer-grade digital cameras can provide a cost-effective and sufficiently accurate solution for creating 3D models of the head as new smartphones come equipped with higher quality cameras. The most common application of head modeling for home users could be the acquisition of head anthropometric data in order to select the appropriate size of headwear products. The other application could be trying out head apparel.
The construction process of the head 3D model for the home user must be fully automatic. The software tool is only allowed to give the user simple directions to correct their actions if they lead to a model of unsatisfactory quality. The simplest way for the user to collect a set of their head images would be to record a selfie video covering as many various views of their head as possible. Using a general-purpose 3D reconstruction algorithm, automatic reconstruction of the head may suffer from the non-static scene and various image photometric distortions.
This work proposes a methodology for the improvement of 3D head reconstruction, primarily from selfie videos, by increasing reconstruction quality and reducing the number of required computations.
The novelty and contributions of this work can be summarized as follows: • Adaptation of a general-purpose 3D reconstruction algorithm to create head 3D point clouds from selfie videos; • Achieved an increase of 3D head reconstruction quality by removal of background information and by selecting a subset of best quality frames from the full set of frames; • Presented and compared methods for the selection of the highest quality frames; • Performed comparative evaluation of feature sources (layer of convolutional neural network (CNN)) and dimensionality-reduction (DR) techniques used to order images by similarity in R 2 and R 3 with the purpose to predict the image's relative pose; • Presented comparative results of the 3D head reconstruction improvements using mannequin head videos.
Overview of the general-purpose 3D reconstruction algorithm and its proposed modifications to improve the 3D head reconstruction process is presented in Figure 1.
The outline of the paper is as follows. In Section 2, materials and methods are described. In Section 3, computational experiments and their results and discussion are presented. Finally, Section 4 gives the conclusions of this work.

Materials and Methods
In this section, we describe the general-purpose 3D reconstruction algorithm and its shortcomings in using it for head reconstruction; we create a methodology for the improvement of 3D head reconstruction and use it to propose changes for the general-purpose 3D reconstruction algorithm; we present the rationale behind the proposed algorithm improvements and their implementation solutions; we describe the experimental data collection process, creation of head reference and test models; and outline the evaluation process of reconstruction algorithms.

Requirements for 3D Reconstruction Algorithm from Usability Viewpoint
Shortcomings of the general-purpose 3D reconstruction algorithm in head modeling arise from the specifics of how the initial data (mostly it will be a selfie video) is collected and the kind of final reconstruction (model) we want to create. We aim to create a head model that is without semantic noise, i.e., the reconstructed scene contains only the head as an object and no other points that would belong to non-head objects. Such a model would not require any automatic or manual postprocessing, which would not necessarily be accurate and successful enough, but also, the model would be more appropriate for making measurements and for visualization purposes. Moreover, we want to create a model that has as few reconstruction errors as possible. Thus, we want the model to be high quality, i.e., having a low level of semantic noise and a low level of reconstruction errors.
Semantic noise in the reconstructed scene will exist as everything will be reconstructed, not just the object of interest. The bad thing is that the noise will interfere with measurements or disturb visualization. It would be possible to edit or filter a point cloud, but this is a complicated task and does not guarantee a quality result. The other requirement for the data collection process in order for the general-purpose 3D reconstruction algorithm worked properly is that the scene must be static. However, if we are capturing our own head (most of the cases) or another person's head, it will not be possible to ensure that everything in the scene is fixed and does not move. Facial emotions during filming for 30-90 s could be controlled, but staying still so that there are no background changes is practically impossible. During reconstruction, a changing background would interfere with the reconstruction of the object, as richer textures in the background may result in a more accurate reconstruction of the object's environment, but not the object itself. A partial solution could be filming in the environment with a patternless, textureless background, but the user would need a background that spans almost entirely around them (such as a corner between walls of the same color) such a place may be hard to locate. Therefore, an easier solution would be to remove the background in the photos so that the background would not have influence.
Another need for adjustment of the reconstruction algorithm is the specialization for working with videos. It is more convenient to film one's head than photograph it, especially if a person wants to image their head. Making selfie videos using a smartphone is more convenient than taking many selfie photos because, during the shooting, a user needs to keep the face as still as possible. Moreover, a user should not move the handheld camera too fast during filming in order to minimize image distortions, such as motion blur and rolling shutter. Slow camera movement during filming will create many similar frames, so it is not helpful to use all frames for the reconstruction. Due to the excessive number of repetitive images, the volume of calculations for the reconstruction will increase significantly, but the accuracy will practically not improve. It would be helpful to detect and remove from the reconstruction process frames that have highly redundant information. Among the many frames, there will also be low-quality ones, where the face is slightly outside the frame or affected by motion blur distortions due to a shaky hand. Such frames also need to be removed. Thus, the basic reconstruction algorithm has been supplemented with actions to remove unnecessary frames and, as a result, lower reconstruction errors.

Methodology for Improvement of 3D Head Reconstruction
Here, we propose a methodology for the improvement of 3D head reconstruction. We seek reconstruction improvement by increasing the reconstruction quality and reducing the number of required computations. We define the model quality by the amount of semantic noise and reconstruction errors-the higher level of noise and errors, the lower the quality of the model. The methodology is a list of possible solutions that systematically originated from the factors that negatively affect the reconstruction process and quality of the head model.
We have summarized the factors that may negatively affect the reconstruction process and quality of the reconstructed head model (discussed in previous Section 2.1.1):

1.
Changing background-due to the movement of the head in respect of the background or existence of other moving objects in the background; 2.
Motion blur and rolling shutter distortion-due to low light conditions and faster movement of the camera, shivering of hand; 3.
Defocus distortions-if the camera focuses on background objects; 4.
Head out of frame limits-stumbles making selfie videos; 5.
Too many frames-due to the inefficient design of camera positioning around the head and, as a consequence, acquired long recording (excess of redundant frames only slows down reconstruction process).
These key modifications of the general-purpose 3D reconstruction algorithm should improve 3D head reconstruction from selfie videos by weakening factors that negatively affect the reconstruction process and quality of the model:

1.
Elimination of image background-suppresses the negative influence of the changing background to the reconstruction process; reduces the amount of semantic noise; background elimination frees from computations in the background region of the image; 2.
Selection of the highest quality frames-as a result, reconstruction errors are reduced because images with motion blur, defocus distortions, and images, where the head is out of frame limits, are removed; reduces the number of frames that are redundant, so the computational load is reduced; removal of redundant frames enables moving the camera slowly while capturing in order to reduce motion blur and rolling shutter distortions.
Specifics of the implementation solutions of these modifications will be presented and discussed in Section 2.1.4.

Baseline Algorithm
The default Photogrammetry Pipeline from the AliceVision Meshroom software (version 2021.1.0) [57] with small adjustments was used as a general-purpose 3D reconstruction algorithm, and in the comparative evaluation of the algorithms it represented the baseline algorithm.
The reasons that led to the choice of the Meshroom were its functionality (features), popularity among users, acceptable reconstruction quality, being open-source, active development, the possibility to access and modify intermediate data, modular structure, and command-line interface. In order to evaluate the proposed modifications of the 3D reconstruction pipeline, a flexible environment for experimentation was needed. Meshroom provides a means to adapt the pipeline through its customizable workflow and/or by accessing intermediate data. It is easy to intervene in the workflow with custom data processing steps. It is worth mentioning that there exist a number of other photogrammetry software as free/open-source and commercial packages. Free/open-source applications for SfM [58]: COLMAP [59,60], OpenMVG [61], VisualSFM [62], Regard3D [63], Open-DroneMap (ODM) [64], MultiViewEnvironment (MVE) [65], MicMac [66] [78].
The adjustments and their justification are following: • Describer Types in FeatureExtraction node were changed from sift to a combination of sift_upright and akaze_ocv-the first change because the camera is not rotated during the capture and hence the feature orientation may be fixed; the second change adds more diverse features to increase matching robustness; • In FeatureMatching node parameters, Cross Matching and Guided Matching, were enabled-to increase matching robustness; • The default single StructureFromMotion node was changed to a sequence of two StructureFromMotion nodes with the following different settings-in the first Struc-tureFromMotion node, the value of parameter Min Input Track Length was changed from 2 to 3, and the value of parameter Min Observation For Triangulation was changed from 2 to 4. In the second StructureFromMotion node, the parameter Lock Scene Previously Reconstructed was enabled, and the value of parameter Min Observation For Triangulation was changed from 2 to 3. Such setup increases the number of reconstructed cameras and reduces the noise in the point cloud; • Only the sparse reconstruction part of the whole reconstruction pipeline was used, so the sparse point cloud from the last StructureFromMotion node was used as the test model in the evaluation.
This baseline algorithm in the context of the generalized 3D reconstruction pipeline ( Figure 1) consists of the steps: 1. Frame extraction from video; 2. Camera initialization; 6. Feature point detection; 8. Feature description; 9. Image matching; 10. Feature (descriptor) matching; 11. Structure from motion (sparse reconstruction). Formally, 5. Frame selection step was also performed in a simple way because a large amount of extracted frames from the video was reduced 3 to 4 times, depending on the initial frame count, so that the remaining frame count was near 400. The set of frames was reduced by taking every third or fourth frame. All selected frames from the videos were sent to the 3D reconstruction algorithm without any preprocessing. Any geometric distortions, for instance, due to camera optics, were corrected during the bundle adjustment process of 11. Structure from motion step when the extrinsic and intrinsic parameters of all cameras, together with the position of all 3D points, are being refined.
The following are the essential steps of the Meshroom's StructureFromMotion node [57,79], which is an incremental algorithm, and are concealed under the 11. Structure from motion step ( Figure 1): 1. Fusion of all feature matches between image pairs into tracks; 2. Selection of the initial image pair and estimation of the fundamental matrix between these two images; 3. Triangulation of the feature points from the image pair; 4. Next best view selection; 5. Estimation of a new camera pose (robust RANSAC framework is used to find the pose of the new camera, and nonlinear optimization is performed to refine the pose); 6. Triangulation of the new points; 7. Performing Bundle Adjustment to refine the positions of 3D points, extrinsic and intrinsic parameters of the reconstructed cameras; 8. Looping from the fourth to eighth step until no new views are localized.
When introducing algorithm improvements according to the presented methodology (Section 2.1.2), adjustments presented here are kept.

Algorithms with Proposed Modifications
In the previous sections, we discussed the requirements for 3D head reconstruction algorithms from selfie videos from the usability viewpoint (Section 2.1.1). Later, the methodology for the improvement of 3D head reconstruction was proposed (Section 2.1.2). The methodology consists of key modifications of the general-purpose 3D reconstruction algorithm to improve 3D head reconstruction from selfie videos. Here, we introduce implementations of algorithm improvements according to the presented methodology.
All modifications are introduced gradually in order to be able to compare their influences on the reconstruction process. It resulted in three major branches of modified reconstruction algorithms and a total of six minor branches. The summary of the 3D head reconstruction algorithms, which will be explored in this work, is presented in Table 1. The main modifications followed from the proposed methodology in Section 2.1.2, which specifies that the elimination of the image background and selection of the highest quality frames should be performed. Table 1. Summary of 3D reconstruction algorithms tested: baseline 3D reconstruction algorithm (1) and its variants (2a, 2b, 3a, 3b, 4a, 4b). All variants introduce head detection and discarding of feature points outside the bounding box of the head. Variants 3a, 3b, 4a, and 4b additionally utilize image quality during the frame number reduction but differ in the applied reduction strategy. 1 reconstruction steps correspond to the order of the generalized 3D reconstruction pipeline in Figure 1; 2 frame selection strategies: "N-th", selects every N-th frame; "Best from N", selects the frame with the highest quality from consecutive N frames; "Best from all", selects a certain number of frames from the full set of frames exploiting image quality and image similarity information; 3 Describer Density preset in Meshroom's Feature Extraction node.

Background Elimination
The first branch of the baseline algorithm is created by adding image background elimination and is labeled as Pipeline 2 with sub-branches {a|b} ( Figure 1). The sub-branches differ in one change of a parameter value: the value of Describer Density parameter in the FeatureExtraction node in the case of variant (a) is normal, and in case of variant (b) is high. Background elimination is implemented as 7. Initial feature point selection following the 6. Feature point detection step of the generalized 3D reconstruction pipeline. The initial feature point selection (or elimination of unnecessary points) process requires information about the bounds of the main object, i.e., the head. This information is provided by the 3. Head detection step of the generalized 3D reconstruction pipeline. The background elimination is implemented through feature point selection-it was a more reasonable way to integrate this step with the Meshroom pipeline. Simple masking of the background in the initial images would lead to spurious feature points on the edge of the background cutting.

Head Detection
A convolutional neural network (CNN) single-shot detector (SSD) [80] is used for head detection [81] in the images ( Figure 2). The model adopted in this research was developed by the authors of LAEO-Net [82]. The model's suitability for the task was evaluated by manually revising the head detection results on the collected dataset of 19 videos. The bounding box (BBox) that indicates the boundaries of the head in the image will be used to remove those feature points that are behind the boundary of the head. During the head detection, not only data on the location of the head in the image are collected, but additionally intermediate results from the intermediate convolutional layers of the CNN ( Figure 3). Data from the feature layers are used as features to describe the image patch containing the head. By using these features, the frames can be grouped according to similarity. This grouping will be exploited later in redundant frame dropping.

Frames Selection Methods
Two goals may be achieved simultaneously by performing frame selection-removing redundant data from the dataset to reduce the dataset and, as a result, reduce the computational load, and removing images that are low quality due to motion and defocus blur. We implemented and tested two different methods for frame selection.
The first method is a straightforward extension of the simplest frame reduction strategy where every N-th frame is selected. The modification is made by integrating image quality estimates into the frame selection process. Image quality is estimated for every frame, and instead of selecting every N-th frame, the frame with the highest quality from N consecutive frames is selected. The image quality estimation method is presented below. The second branch of the baseline algorithm is created by adding the simplest frame reduction strategy together with the previously added image background elimination and is labeled as Pipeline 3 with sub-branches {a|b} (Table 1). This frame reduction strategy is implemented as 4. Image quality estimation and 5. Frame selection steps of the generalized 3D reconstruction pipeline ( Figure 1).
The second frame selection method is more universal. It selects a predefined number of frames from a full set of frames, so the images may have come from an unordered image set-from a video with chaotic camera trajectories, from different videos, or collected as photographs. To achieve a satisfactory object 3D reconstruction result, we need images that are evenly spaced and cover a wide area around the object, and we need to additionally include a spacing term in the image quality estimate. The image quality estimation method in combination with spatial image ordering is presented below. The third branch of the baseline algorithm is created by adding the frame reduction strategy, which performs image ordering by similarity and later selects the best quality images in image groups, and is labeled as Pipeline 4 with sub-branches {a|b} (Table 1). This frame reduction strategy is implemented as 4. Image quality estimation and 5. Frame selection steps of the generalized 3D reconstruction pipeline (Figure 1).

Image Quality Estimation
The image sharpness metric was used as an estimate of image quality for the frame selection. This algorithm implements the 4. Image quality estimation step of the generalized 3D reconstruction pipeline ( Figure 1) when Pipeline 3{a|b} is selected (Table 1).
Key algorithm steps for image sharpness estimation: 1.
Detect the head region defined by BBox (it is already detected in the background removal step); 2.
Calculate Region of Interest (RoI) parameters: define the size of square as the largest edge of head BBox; 3.
Crop RoI part and resize to 256 × 256 px image patch; 4.
Calculate the variance of filtered patch; 6.
A larger variance represents a higher image sharpness.

Frame Pose Prediction by Image Similarity Ordering
Frame pose prediction by ordering images according to similarity is a crucial step to create a subset of images that covers a wide area around the object and contains evenly spaced images. Here, we define image similarity in terms of camera pose in 3D space. Pictures or frames having similar poses will likely be similar if the scene is static. Image ordering by similarity is a proxy task to predict the relative poses of the frames. Having relative poses, we could select the best quality image from the image group corresponding to a predefined region of the surrounding space. To imitate image ordering in 3D space or in 2D space, if we assume that the camera keeps an approximately constant distance from the head, we would like to get 3D or 2D embeddings of the images.
To get image embeddings in 2D or 3D, a possible solution would be to collect multidimensional feature vectors that describe images containing a head from the CNN that were used to detect heads in the images and later to reduce dimensionality. The CNN model is trained to detect heads, so the features extracted by the network should serve as good descriptors of the head image patch. Additionally, it would be the third task where the same CNN model serves, i.e., head detection for initial feature point selection, for image quality evaluation as the RoI provider, and here, as a feature extractor for image description. Feature vectors can be taken from any feature layer at any (row, column) position. The position (row, column) is determined from the results of the same network-the center of the detected BBox of the head (Figure 3). A suitable feature layer may be suggested according to the size of the receptive fields of the units and from the units that the feature layer has. The further the feature layer starts from the input, the larger the receptive fields of the units of that layer are. The size of the receptive field will determine what part of the image the extracted feature vector describes. We want to compare, by similarity only, the image regions that semantically represent the head. Intuitively, the size of the receptive field should be such that it spans the region of the head in the image. However, we will perform experiments to select the feature layer that is most helpful to provide feature vectors ( Table 2).   Particular feature vector (from a specific layer, certain (row, column) location) will mostly be shift invariant, but not scale or rotation invariant. Shift invariance was achieved by using information about the detected center of the head BBox to determine the (row, column) location of the feature vector. Rotation invariance is not as important because during the short video capture, the camera may be used without large tilt rotations. Scale invariance probably would be slightly needed if we made a selfie video using an outstretched hand. If the video was made with a strongly changing distance from the camera to the head, or if we use frames from different videos, the scale of the head in separate frames may differ. This can lead to the situation where feature vectors differently describe the same object due to the change of the object's size. Scale invariance may be achieved by performing double-pass head detection-after the first run, the detected BBox is used to crop the image region with the head, and the cropped image is passed to the model for the second detection. We will perform experiments to check what changes in the results of image similarity order may be achieved by adding a second pass.
The extracted feature vectors are multidimensional. The dimensionality of the feature vector is equal to the number of channels in the feature layer. In order to get image embeddings in 2D or 3D, we must reduce the dimensionality of the feature vectors. A set of dimensionality techniques will be compared in order to select the one that, combined with the selected type of feature vector, will provide the best-ordered images by similarity. The goodness of the image order will be measured by the percentage overlap of the two sets that contain the closest images to the target image. It means that for each image, we find the closest group of images in space (according to known image poses), and we find the closest (most similar) images according to the extracted feature vectors. The percentage overlap of these sets gives the estimate of the goodness of the image order. As the ground truth poses the images, we use the reconstructed poses using Pipeline 2b ( Table 1).
The following dimensionality-reduction techniques will be experimentally compared for suitability for image ordering by similarity. Implementations of the techniques were used from the Matlab Toolbox for Dimensionality Reduction (https://lvdmaaten.github.io/drtoolbox accessed on 9 August 2021) [83,84].
The best performing combination of the feature type and dimensionality-reduction technique will be used for frame selection in Pipeline 4.
Key algorithm steps for frame selection in Pipeline 4: 1.
Extract feature vectors describing the regions of images that contain the head; 2.
Perform dimensionality reduction using the selected technique; 3.
Define a grid in the low-dimensional feature space that divides the space into uniform cells. A step size of the grid depends on the total frame number we want to select (in this research, the target was 200 frames); 4.
In each cell, if several frames get into the same cell, only the image with the largest sharpness gets kept.
Visualization of the frame selection process using gridding is presented in Figure 4. Results of experimental comparison of feature types, dimensionality-reduction techniques, single-pass vs. double-pass, and image embedding in 2D vs. 3D, are presented in Section 3.

Creation of the Head Models
Getting the evaluation results of reconstruction algorithms is based on the comparison of test and reference models. For objective evaluation, it is crucial to create a high-quality reference model. The creation of test models is directed by the algorithms we seek to compare. Therefore, the construction processes of reference and test models have differences. The reference and test models were constructed using the specifically collected data. The collection process of the video and photo data is described in Section 2.5.

Reference Model Creation
The goal of the reference model creation task is to reconstruct the mannequin's head with the highest precision. This 3D model should have the lowest level of semantic noise and the lowest level of reconstruction errors. Semantic noise (any points belonging to the non-head class) may be reduced by removing background information from the images. Possible reconstruction errors may be reduced by making and selecting the highest quality images. The creation of the reference head model does not have time or tool selection constraints or any manual work quota. After the photos were taken, they were manually edited to remove the background. The background is removed approximately by trying to select as much as possible of it without damaging parts of the head. The photos were also reviewed to avoid poor-quality photos with poor focus and motion blur distortions. A total of 187 photos were selected for reference model reconstruction. Three-dimensional photomodeling was performed using Meshroom software (version 2021.1.0) [57]. The default pipeline of Meshroom photogrammetry with the default parameters was used, except the Describer Density preset from Feature Extraction node was changed from normal to high, and the Describer Type was changed from sift to sift_upright, forcing orientation of all features the same. The reconstructed reference head model with camera positions is shown in Figure 5. In the evaluation of the automatic reconstruction algorithms, the result of the final reconstruction step, i.e., the mesh (refer to Figure 1, step 17. Texturing of the reconstruction pipeline), is used.

Creation of Test Models
The creation of test models is made according to the reconstruction algorithms we seek to compare. Here we use video data simulating selfie video scenarios. All frames from the videos without any preprocessing are fed to the 3D reconstruction algorithms previously described. Three-dimensional photomodeling was performed using Meshroom software in tandem with Matlab, which was used for the implementation of algorithm modifications. The settings of the Meshroom and algorithm modifications are described in Section 2.1. In the evaluation of the automatic reconstruction algorithms, the result of the Structure from Motion reconstruction step, i.e., the sparse point cloud (refer to Figure 1, step 11. Structure from Motion of the reconstruction pipeline), is used.

Reconstruction Quality Evaluation
Three-dimensional head reconstruction algorithms were evaluated and compared by several tests. The most important results were gathered by comparing the created test models (sparse point clouds) to the reference model (mesh). Details on the model construction procedures can be found in Section 2.2. The comparison of the models was organized in two different setups: by comparing the distances between all closest points of the aligned models and by comparing the distances between the closest points of aligned models only in the facial area of the head. The rationale of comparing all points-it evaluates the overall quality of the model (incorporates the influence of the non-model parts to the model's evaluation results)-includes semantic noise (objects from the background) and assesses the need for additional processing of the model in order to clean it. The rationale of comparing only the facial points of the models shows the algorithm's ability to reconstruct fine details of the head that are relatively stable, i.e., excluding parts that may be changing during separate imaging runs. The hair region shape can be easily distorted (distortions may be larger than face details but smaller than variations in the whole reconstructed scene), so only model points from the facial region are used in the model comparison. Additionally, head shape, not necessarily including hair, will be the right source of head size information for applications, such as for size selection of hat, helmet, glasses, or similar wearables. Points of the reference model were manually classified into facial and non-facial regions. During the model comparison, when the distances between the closest points of two models are computed, non-facial points of the reference model and the closest points of the test model are discarded.
In this research, the absolute scale of the models was not calculated. This is the consequence of using uncalibrated 2D images. Additional information is needed in order to estimate absolute scale [85]. Scale differences are eliminated during the alignment of test models to the reference model; therefore, comparative evaluation of the automatic reconstruction algorithms does not require scale information.
The comparison procedure of the test and reference models when all closest points of both models are used (Evaluation Case 1) and only points in the facial area of the head are used (Evaluation Case 2) (all steps are common for both cases unless otherwise noted) is as follows: 1.
Three-dimensional facial feature points are detected in the test and reference models (explained below in this Section and in Figure 6): (a) detection of facial feature points in individual frames; (b) transfer of points from images to the 3D model;

2.
Estimation of parameters of the 3D geometric transformation between two sets of 3D facial feature points. Applying the geometric transform to the test model to align it to the reference model; 3.
Finding of the closest test and reference model points and distances between them using the k-nearest neighbors algorithm; 4.
(Only in Evaluation Case 2) Remove distances that include points from the facial region of the reference head; 5.
Evaluate the distances (as residual errors of model alignment) by applying statistical methods to find the mean and confidence intervals.

Facial Feature Point Detection
Anatomical landmarks, in this research, facial feature points, provide the means to perform various manipulations with the target object [21,[86][87][88][89]. In this research, facial feature points were used to align the test and reference 3D models. The approach to use facial feature points for model alignment is selected due to the variety of the created test models. In cases when the test point cloud contains a large number of spurious points and reconstructed points from background objects, point cloud alignment using the traditional iterative closest point algorithm will likely fail. Facial feature points may be detected in 2D images with high confidence. Additionally, faces will be detected in multiple images, and this will lead to higher localization precision of facial landmarks. Knowing the parameters of the reconstructed cameras, feature points may be transferred from 2D images to the reconstructed 3D model. After transferring landmarks to the 3D model, multiple coordinates representing the same facial landmark are averaged after removing outliers. Facial feature points were detected in the images using the FaceLandmarkImg.exe tool from the facial behavior analysis toolkit in OpenFace (version 2.2.0) (https://github. com/TadasBaltrusaitis/OpenFace accessed on 9 August 2021). The description of the landmark detection algorithm may be found in [90,91]. The detection of landmarks was not performed in the highly off-angle (profile) images. An example of the detected facial feature point locations on a 2D face and their locations on the 3D model is shown in Figure 6.

Software Used
The software tools and programming languages we used in this research are:  [83,84].

Setup and Data Collection
The performance of the 3D head reconstruction improvements was tested on the mannequin head. Comparative evaluation of the algorithms requires a reference head model and test models. The capturing of the head was performed differently for the creation of the reference model and for the test models. Imaging of the mannequin head for the reference model was performed in such a setup that it would allow for the creation of a high-quality 3D model. Imaging setup for the test models was determined by the need to compare the performance and expose the properties of the 3D reconstruction algorithms while applying the algorithms in real-world scenarios.
Firstly, the mannequin head was prepared for capturing and photogrammetry by giving it a faint texture; because the mannequin's skin was very smooth and even, without any pattern compared to a real face's skin, which has a texture, the face of the mannequin was covered with faint glitter makeup. The presence of the texture is necessary for the successful matching of image patches during the reconstruction. The given makeup can be observed in the images of Figure 5a.
The pictures of the mannequin for the construction of the 3D reference head model were taken using a Nikon D3200 Digital SLR camera. The photographs were taken in an environment where the lighting of the dummy was uniform and adequate. Shooting settings: image resolution was set to the maximal 6016 × 4000 pixels, the photo quality was set to maximal, flash was turned off, focal length was kept fixed (focal length 18 mm), focal ratio f/3.5, exposure time 1/500 s. During all shooting, the mannequin's head was kept steady, without turning on the base, keeping the background neutral and without changes.
The videos for the creation of the test models were acquired using the smartphone Samsung Galaxy S10+ standard Camera App. For the comparative evaluation of the algo-rithms, 19 videos were taken. Acquisition conditions were varied while taking individual video footages: changing the orientation of the smartphone, changing lighting conditions, stationary or varying background, frame rates of 30 or 24 frames/second, frame size 3840 × 2160 pixels, and changing mannequin makeup for more or less glitter. The average length of the videos was 51.5 ± 16.7 s. The movement pattern of the phone while capturing was the same for all videos-a zigzagging sideways movement while moving slowly from top to bottom, trying to imitate an effort to make a selfie video that captures one's head from all sides as wide as possible from reaching with a hand.

Results and Discussion
This work presents and evaluates a methodology for the improvement of the 3D head reconstruction process. The methodology is created keeping in mind that 3D reconstruction algorithms are intended for use in creating head models from selfie videos, and the models will most likely be used to make head measurements in order to select a suitable size of head wearables (hats, helmets, eyeglasses, etc.). This application of the algorithm forces the exploitation and respect of the properties and constraints of such data. The adaptation of algorithms to process this kind of data was the scope of this research.
Identified factors that may negatively affect the reconstruction process and quality of the reconstructed head model are as follows: changing background (non-static scene), motion blur, defocus and rolling shutter distortions, head out of frame limits, and excess of redundant frames, which only slows down the reconstruction process.
The primary sources of 3D head reconstruction improvements are-increase the reconstruction quality and reducing the number of required computations. The quality of reconstruction is defined by two components-reconstruction errors of the head and the amount of semantic noise. Thus, the approaches of quality improvement are to reduce both of the mentioned components. Semantic noise is reduced by minimizing non-head points in the reconstructed model, so the reconstructed scene includes only head points (this is mainly reflected by the results of Evaluation Case 1). Reconstruction errors of the head are reduced by suppressing factors that deteriorate the reconstruction precision (this is reflected by the results of Evaluation Case 2). The reduction of semantic noise leads to an easier localization of the head feature points, where anchors may be attached for measurements; reduced reconstruction errors provide a more precise head model and thus more accurate and reliable measurements.
Semantic noise reduction is achieved by removing other objects (background information) from the initial head images. Reduction of reconstruction errors is achieved by increasing the image quality used for reconstruction, i.e., by selecting and using images of the highest quality. Images with higher quality (here, quality is mainly defined by the amount of motion blur and defocus) allow for more precise reconstruction of the head.
Reduction of computational demand is achieved by these solutions: reducing the image number used to reconstruct the model by discarding redundant frames and reducing the feature number in images (leaving only features related to the head).
In summary, the required key modifications of the general-purpose 3D reconstruction algorithm in order to improve 3D head reconstruction from selfie videos are the elimination of image background and selection of the highest quality frames.
The proposed modifications to the general-purpose 3D reconstruction algorithm were introduced gradually, and their influence on the reconstruction process was evaluated in the reconstruction experiments. The gradual introduction resulted in three major branches of modified reconstruction algorithms and a total of six minor branches of algorithms: Two basic experiments were designed and used to perform a comparative evaluation of the algorithms. One experiment was for the evaluation of the core part components of Pipeline 4 (results in Tables 3 and 4). The second experiment evaluated the generalpurpose 3D reconstruction algorithm and its three major modifications we proposed in the reconstruction of the head from selfie videos (results in Table 5).
For evaluation, experimental data were collected. The dataset consists of 19 test videos of the mannequin head and the reference head model. The reference head model was constructed from high-quality photographs with some manual input of the operator to increase the quality of the head model. For the test data, the head was captured in such a way that imitates selfie videos. The details of data collection are presented in Section 2.5. A summary of the common statistics about processed experimental data is presented in Table 6. Sparse point clouds of reconstructed heads by using all reconstruction pipelines discussed in the article are presented in supplementary Figure S1.
Comparative results of Pipeline 1 and 2 reveal the influence of image background elimination on 3D head reconstruction. The evaluation of Pipeline 3 shows the cumulative influence of an additional minor change-the selection of the best quality frames from several consecutive frames. The results of Pipeline 4 reveal a larger influence of selection of the highest quality frames from the full set of frames. The construction of Pipeline 4 required the selection of a combination of the feature sources (layer of CNN) and dimensionalityreduction (DR) technique used to order images by similarity. The latter comparison is performed in a separate experiment.
Pipeline 4 uses a more universal method to select images of the highest quality. If we have ordered images (i.e., frames with known poses in space), we could simply select the best image from the group of the closest images-this approach is implemented in Pipeline 3. If the image pose is unknown, we first have to predict some probable relative pose, which can be done by ordering images according to similarity. For the image similarity assessment, we used features from CNN that were used to detect the head. The extracted feature vectors were used as descriptors of the image patch that holds the head. For image embedding in 2D or 3D (needed for frame selection method), dimensionality reduction is required, as image descriptors are multidimensional vectors. Dimensionality techniques were compared in combination with feature type (source layer of CNN). The results of the discussed methods for image ordering potency by similarity are presented in Table 3 (embedding in 2D case) and Table 4 (embedding in 3D case). The results in the tables represent the portion of correctly predicted images being the most similar to the reference image. A score of 100 would show that the method correctly predicts the full image group that is closest to the reference image when the reference closeness is calculated from the known image poses. The best performing combination of feature type and dimensionality-reduction technique is taking features from the 14th convolutional layer Conv 6 (fc6) and the t-SNE DR technique. This is valid in both 2D and 3D cases and in both head detection cases-single pass and a double pass. Comparing feature types and DR techniques separately, the findings are the same-features from the 14th convolutional layer and t-SNE DR technique perform the best. From the DR techniques comparison, the second-best result is performing no dimensionality reduction. The second best feature source depends on the head detection strategy (one pass or two passes). Comparing the head detection strategies, the results show that the two-pass strategy helps increase the usefulness of features from further convolutional layers (starting from 17th). The receptive fields of units from these layers are larger, the feature layers themselves are smaller, so the variations of head positioning within the receptive field are corrected more by head detection than selecting a feature vector from a suitable (row, column) location. If averaged over all feature layers, the two-pass strategy systematically increases the scores. Image embedding in 3D provides slightly better scores than that in 2D. Table 3. Comparison of feature sources (convolutional layer number) and dimensionality-reduction techniques used to order images by similarity (closeness) in R 2 . The values in the table are scores (mean ± SD), which represent the number of correctly predicted images (in percent) being the most similar to the reference image. The best performance was highlighted in red.    The results of the main comparative evaluation of the general-purpose 3D reconstruction (as a baseline, Pipeline 1), and its modifications (Pipeline 2-4) we propose, are presented in Table 5. There are two evaluation cases: Evaluation Case 1 measures 3D head reconstruction residual errors in the entire region of the head, and in Evaluation Case 2 residuals are calculated only in the facial region of the head. The results show that the lowest averages of residuals of all videos are provided by Pipeline 4 (12 times smaller in Evaluation Case 1 and 7 times smaller in Evaluation Case 2, compared to Pipeline 1). Comparing the submodifications a and b of the algorithms, in case b, when more features in the images are detected, the residuals are slightly lower. The influence of the semantic noise on the quality of the reconstructed head model is mainly reflected by the results of Evaluation Case 1 where residuals are calculated in the full area of the reference head. Evaluation Case 2 tells more about the model reconstruction errors (precision of the head reconstruction) that may be influenced by the quality of the images. Tendencies of the residual changes provided by different algorithms are consistent between these cases. If the scene we are trying to reconstruct was not static, the baseline reconstruction process might fail, as happened with the Video 10 case, because a moving background existed. Summarizing the data from Table 6, on average, Pipeline 1-3 were compared using 412 frames and required over 12,000 image pairs to compare. Pipeline 4 used on average 202 frames and required two times fewer image pairs to compare. The number of features to match mainly depended on the submodifications of Pipelines-in case b, more than twice the feature points in the images were detected. The baseline case used a large number of features because features in the background region were not discarded. The evaluation of the reduction of computational demand shows that the introduction of all proposed improvements compared to the baseline algorithm reduced the frame number needed to process by two times, reduced the number of image matches required to perform by six times, reduced the average number of feature points in images by 1.4 times, while the point count in the facial area of the head point cloud was reduced by only 1.2 times and provided the highest precision of the head reconstruction.
The proposed photogrammetry algorithm improvements are highly adapted for head reconstruction. To extend the usage of the algorithm for the reconstruction of other objects, the head detector should be replaced with the detector of the target object. Object detectors trained to recognize multiple object classes may prove useful in this case. The proposed algorithms would be least adaptable for reconstruction applications beyond close-range photogrammetry. Background masking would not be applicable to aerial cases.
There is still some room for additional improvements to the proposed algorithms. The initial feature selection step could be upgraded to be able to mask the background more precisely. The current head detector returns a bounding box that is used to select useful feature points, but the bounding box does not follow the head contours. Face detection and segmentation model, which provides a head segmentation mask, would allow the removal of more feature points that belong to the background region. Another upgradable operation is image subset selection using the regular grid after using the t-SNE dimensionality-reduction technique for ordering images by similarity. Due to the meaning of the distances in the case of t-SNE, the regular grid for partitioning the space in order to group similar images may not be the optimal approach. More sophisticated methods could better exploit the advantages of the t-SNE technique in finding the closest images. The results of experiments of the image ordering by similarity (Tables 3 and 4) show that the relative distances between images provided by t-SNE bear the largest amount of information about the real distances between the poses of the views. In this research, the absolute scale of the models was not estimated. Determination of the absolute scale can be added in future research to create 3D models for absolute measurements.
Some of the limitations of the presented experiment are that only the mannequin and a single head were used for the experimental data collection. Larger testing scenarios with more mannequins and even real people are of interest. An interesting question is how the proposed algorithms would work in the case of real faces. Insights that were made during the preparation and execution of the current experiment tell that there are factors that in, the case of reconstructing real faces, would help to improve the model's quality (lower errors), but there are also factors that may impair the automatic reconstruction process. Real faces could be reconstructed easier as they have various additional patterns that help to extract more distinct features in the face area. This improves the determination of the corresponding areas. Sources of reconstruction difficulties would reside in the capturing of real faces using a camera. Involuntary face movements, small or large, are inevitable. Some of them could be tolerated to a certain level. Larger movements need to be detected and eliminated. The solution could be to automatically detect face movements that can not be tolerated (if they have too much negative impact on the model's quality) and exclude a subset of the samples, or the user could be asked to recapture one's face. Modifications of the algorithms were designed keeping in mind that capturing one's own head is most convenient by making selfie videos using a smartphone. During a short filming period, the face can be kept sufficiently still.
Another subject of improvement in further experiments could be the construction of a reference head model. A better reference model could be created using high-accuracy 3D scanners. This will be necessary in case absolute measurements need to be compared. In the current research, our approach to create the reference model using the photogrammetry pipeline without the proposed modifications that need to be tested does not prevent relative comparison of the models and evaluation of the modifications to reveal their relative influence. The quality of the reference model is maximized by using high-quality manually revised images.

Conclusions
This work proposes a methodology for the improvement of 3D head reconstruction. The primary application of these 3D reconstruction algorithms is to create one's head model using selfie video as the data source, so the improvements of the algorithms are directed and somewhat constrained by the origin of the data. The adaptation of the algorithms to process this type of data is the scope of this research.
The evaluation of 3D head reconstruction improvements was performed using 19 videos of a mannequin head. Reconstruction quality depends on the amount of semantic noise and reconstruction errors of the head. The influence of the semantic noise on the quality of the head model is mainly reflected by the results of Evaluation Case 1, where residuals are calculated in the entire area of the reference head. These results show that the baseline algorithm is improved 12 times by introducing all improvements-elimination of the features in the background and selecting a subset of the best quality frames from the complete set of frames. The same modifications of the algorithm presented the largest improvement (nearly seven times) in Evaluation Case 2, where residuals are calculated only in the face area of the reference head. The latter experimental case reflects the precision of the head reconstruction.
The selection of a subset of best quality images is based on image ordering by similarity. Comparative evaluation of feature sources (layer of CNN) and dimensionality-reduction techniques used to order images by similarity showed that using t-Distributed Stochastic Neighbor Embedding (t-SNE) in combination with features from the 14th convolutional layer (out of 25) of CNN to order images by similarity in R 3 provides the largest number of correctly predicted images (75%), being the closest to the reference image. A comparison of single-step and two-step approaches of head detection showed that in this case (combination of 14th layer features and t-SNE), the approaches perform similarly.
The evaluation of the reduction of computational demand shows that the introduction of all the proposed improvements compared to the baseline algorithm reduced the frame number needed to process by two times, reduced the number of image matches required to perform by six times, reduced the average number of feature points in images by 1.4 times, while the point count in the facial area of the head point cloud was reduced by only 1.2 times and provided the highest precision of the head reconstruction.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/app12010229/s1, Figure S1: Sparse point clouds of reconstructed heads by using all reconstruction pipelines discussed in the article.

Data Availability Statement:
The data presented in this study are available upon reasonable request from the corresponding author. The data are not publicly available due to privacy issues.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: