1. Introduction
Dentofacial deformities interfere with the quality of life, both psychologically and physically. Patients with dentofacial deformities require a meticulous assessment of the dentoalveolar position, facial skeleton, facial soft tissue surface, and their interdependency. The treatment plan must correct this triad in order to achieve an aesthetic and stable result with adequate function [
1,
2]. The treatment plan is dependent on an in-depth physical examination, cephalometric analysis, patient age, and severity of the malocclusion. In complex cases, a combination of orthodontic treatment and orthognathic surgery is required [
3,
4,
5].
A systematic review found that three-dimensional models are accurate tools that allow clinicians to identify and locate the source of the deformity as well as its severity. Moreover, 3D models are realistic tools for treatment planning and assessment of outcomes. Since these models can be manipulated in any direction, avoiding patient recall and decreasing the time of the appointment, the treatment simulation and pre- and post-treatment comparisons are easier, allowing the clinicians to assess facial changes following orthodontic and/or surgical treatments. Despite the progress in 3D imaging, the available imaging techniques still do not allow the representation of the three elements of the face (skeleton, soft tissue, and dentoalveolar) at the same time with optimal quality [
6].
A correct diagnosis and surgical plan are fundamental to obtaining improved therapeutic results. To achieve this, the clinician can use a three-dimensional computerized virtual preoperative simulation using cone beam computed tomography (CBCT) [
7,
8]. CBCT is a medical imaging technique that uses a cone-shaped radiation beam that rotates 180° to 360° around the patient [
9], allowing the images to be reconstructed three-dimensionally [
10]. It is currently an important tool in diagnosis, planning and postoperative follow-up of complex maxillofacial surgery cases [
11]. It further complements conventional two-dimensional X-ray [
12] imaging techniques, given that these have some limitations [
13,
14]. Moreover, CBCT allows for detailed planning, not only by the creation of a virtual 3D craniofacial model, but also by the virtual simulation of the planned surgery. This process is based on a simplified anatomical virtual model with hard, soft, and dental tissues. The obtained model facilitates the construction of surgical splints using CAD/CAM techniques, thus rendering manually constructed splints unnecessary [
15]. CBCT reconstruction with a digital dental cast is the most accurate model to visualize the facial dentition and skeleton. However, CBCT skin is untextured and to allow a more realistic visualization of the 3D facial model, a superimposition of the textured facial soft tissue surface (e.g., 2D photographs) is required. This process is easy and cheap, and uses a specific algorithm and software obtainable through surface-based registration [
6].
It is possible to obtain 3D models by reconstruction of CBCT images using techniques such as multiplanar rendering, surface rendering, which includes contour-based surface reconstruction and isosurface extraction based on marching cube, and volume rendering [
16]. Multiplanar rendering is a technique that displays intensity values on arbitrary cross-sections using volumetric data. The volume is projected onto three orthogonal spatial planes (coronal, sagittal and axial), allowing the user to navigate while changing the coordinates of the three planes [
16]. Surface rendering allows for volume quantification through datasets that have common values, using voxels, polygons, line segments or even points. Marching cube has the ability to define about 15 intersection patterns, resulting in a detailed construction of a desired surface [
16,
17,
18]. Volume rendering allows the visualization of the 3D model from any direction, making it possible to obtain a high-quality analysis of the various layers. This enhanced viewing experience thus greatly facilitates the ability to interpret the data. Among the various techniques described, volume rendering presents the best image quality but requires greater computational power [
16].
The most critical aspect of virtual surgery planning is currently focused on the soft facial tissue and on the outcome of future bone and tooth movement. However, the combination of 2D photographic images with 3D models generated from CBCT images has not yet been sufficiently explored. Recently, significant progress has been achieved regarding the 3D reconstruction of facial models from a single portrait image. According to current scientific evidence, it has been concluded that this method has a superior performance compared to the others, since it requires a smaller number of resources (one portrait photograph) and presents greater surface detail [
19]. However, CBCT reconstruction may include artefacts caused by restorations or orthodontic appliances or defects in the clear representation of facial patient features (e.g., face color), which affect the optimal representation of the 3D facial image. Therefore, new algorithms must be tested in order to obtain a more realistic and accurate 3D facial model that allows the performance of several simulations with different osteotomies and skeletal movements. The aim of this study was to assess the possibility of external facial reconstruction using CBCT medical image volumes and patient portrait photographs.
  2. Materials and Methods
  2.1. Study Design
The following study was approved by the Ethics Committee of the Faculty of Medicine of the University of Coimbra (Reference: CE-039/2020) and was conducted in accordance with the Declaration of Helsinki. All participants signed a written informed consent for their participation.
  2.2. Data Collection Procedure
CT images of all participants were collected from the database of patients who underwent CBCT of the skull at the Medical Imaging Department of the Coimbra Hospital and University Centre between January 2015 and February 2021.
Patients included in the study conformed to the following inclusion criteria: patients with skeletal class I, II and III dentofacial deformities in need of orthodontic-surgical treatment; Caucasian patients aged over 18 years; and patients with CBCT and photographic records acquired before and after orthognathic surgery. The exclusion criteria included: patients with congenital abnormalities or syndromes with craniofacial deformities; patients with previous head and neck trauma; patients with multiple missing teeth, untreated caries lesions, active periodontal disease; and patients with a previous history of orthodontic treatment.
All CBCT image information was exported in DICOM format, which represents a set of standards that ensure the safe exchange and storage of radiological images. Photographs were exported in JPEG format. As for the CBCT scanner, an i-CAT machine with a voxel dimension of 0.3 mm and acquisition time window of 13.4 s was used. The machine was calibrated for all data acquisitions. Regarding photographic recording, an RGB digital sensor camera was used, and the same model and focal length were used for extraoral photographs. The exposure of the patients’ faces raises ethical issues; however, it is essential for the development of this study that these are exposed. 
Figure 1 presents a flowchart on the methods used to develop a co-registration method of CBCT and photo images and, consequently, for the realistic facial image reconstruction.
  2.3. CBCT Volume
  2.3.1. Rendering
The software chosen for processing and exporting the CBCT volumes was Matlab R2020b. The images can be visualized using the Volume Viewer application, which has the ability to render them using several methods. Volume rendering was preferred due to its advantages.
  2.3.2. Thresholding and Skin Segmentation
Data acquired by CBCT commonly present some degree of noise as well as artefacts. In order to manage the range of voxel values, a thresholding technique was used. Through the aforementioned technique, the image intensity value corresponding to air and noise from different sources was determined. The main area of interest for this study was the skin surface, so smoothing and correcting imperfections represented a considerable challenge, as CBCT shows low contrast for soft tissues.
The threshold value was determined from the histogram obtained from a scan chosen from our database (
Figure 2). In the histogram (logarithmic scale) under consideration, it is possible to observe an absolute minimum within the range of negative values and it was verified that this value could be used to distinguish useful information from noise and air. There is no inconvenience in preserving the data referring to hard tissue, as well as other metallic materials, such as prosthesis, that may be present.
After defining the range of voxels, the image was binarized to ensure that all voxel values above the threshold (zone of interest) were equal to 1 and the remaining were equal to 0. Therefore, the image of the patient’s head becomes a single cohesive volume, facilitating the visualization and data processing.
  2.3.3. Image Processing Techniques
To achieve an adjustable high-definition 3D model of the patient’s head, a 3D Gaussian smoothing filter with a variable standard deviation value was used. Despite being an effective filter, it is unable to fill gaps or voids or, even, to exclude single separated voxels from the patient model. Therefore, to overcome this situation, closing, erosion, and filling operations were applied to the smooth images to obtain a version without holes or separated voxels.
  2.3.4. Cinematic Animation of the 3D Craniofacial Volume
After achieving the 3D volume of the face, it is necessary to capture and export the craniofacial model at various positions to establish a number of frames in order to create a three-dimensional cinematic animation.
After establishing the rotation trajectory around the craniofacial model and taking into consideration the camera position in relation to the volume, spatial transforms were used to draw the desired trajectory along the three space axes.
In addition to defining a trajectory, it is necessary to capture each of the frames and export them in PNG format. It is important to point out that the rotation angle amplitude, the number of frames and the volume and background colors are parameters that can be adjusted in order to obtain the best possible representation of the model (
Figure 3).
  2.4. Photographic Image
Unlike 3D volumes, photographic images lack information regarding depth; therefore, face-recognition and face-swapping methods were used to overcome this issue. These techniques are based on fiducial points, thus allowing the information of a 2D image to be mapped, even without the depth data of the facial points. Based on the information of the fiducial marks, a Python routine was implemented using several packages, such as OpenCV, NumPy and Dlib, that allowed the 2D images to be mapped to each other.
  2.4.1. Face Recognition
Some face recognition algorithms identify features through reference points from an image of the face. They can be classified under two categories: geometric, which analyzes distinct features; and, photometric, a statistical approach that breaks down an image into figures and compares them with reference models. In this case, these features were useful to identify and process as a human face to extract facial landmarks and fiducial points of each patient, to later enable face-swapping. At this stage, the processed CBCT volume is the target subject, and the portrait photograph is the source.
The fiducial point detection algorithm allows these points to be identified from models based on different sets of points and that may have been obtained from the most varied computer vision techniques, such as ones obtained through machine learning techniques or neural networks. Consequently, there is no need to train a face detection model, which simplified the progress of this project.
Considering these aspects, a model was used that allowed for the identification of 68 characteristic points of the human face. This model was trained with the ibug 300-W database, a dataset consisting of 300 images of in-the-wild faces in an indoor environment and 300 faces in an outdoor environment [
20].
  2.4.2. Delaunay Triangulation
To implement the process of face swapping, it is necessary that both faces are divided and processed geometrically, as it is not possible to just “swap” one face for the other given that faces vary in size and perspective. However, if the face is sectioned into small triangles, the homologous triangles can be simply swapped, keeping the proportions, and adapting the facial expressions, such as a smile, open mouth or closed eyes.
Prior to the triangulation, it is necessary to create a mask that covers all 68 points. To do so, an operation to obtain the optimal perimeter is applied using the most external points of the face image.
Regarding the approach to facial point triangulation, Delaunay triangulation was applied [
21]. This algorithm can maximize the minimum angle of all the angles of the triangles involved in the point triangulation. Hence, it is possible to obtain a matrix that presents three sets of coordinates for the vertices of each triangle in two dimensions, resulting in the subdivision of the 2D faces into triangles.
  2.5. Fusion of 2D and 3D Modality
To achieve the aim of this work, it was necessary to merge the data from the 3D volume with the data obtained from the processing of the 2D images.
The DICOM volume has all the information regarding the different perspectives and dimensions of the patient’s face; however, it lacks the photographic aspect. On the other hand, the photograph has all the colorimetric data, such as skin tone, facial hair, and eye color.
To solve this problem, the colorimetric data was extracted from the patient’s photograph, based on the Delaunay mapping and triangulation technique. This way, it was possible to achieve a 3D animation of the patient, given its photorealism.
  2.5.1. Face Swapping
In the face-swapping technique, the image of the target subject’s face must be defined, which in this study was each of the generated CBCT volume frames. The segments of the face were transposed using a single portrait photograph of the patient. The segments were transposed in the form of triangles that form homologous pairs with the target face triangles. Each of the triangles had three vertices that correspond to three of the 68 fiducial points, which facilitated information matching between the two face mappings.
On the original face from the patient’s portrait photograph, Delaunay’s triangulation was applied. However, on the target face, a different approach was used. Based on the fact that 68 fiducial points were obtained for both the target and source subjects, the triangles of the target face were calculated according to the triangles of the source face. This meant that a triangle on the origin face, composed, for example, of indices 1, 2 and 37 of the fiducial point map, will correspond to a homologous triangle on the target face, whose indices will also be 1, 2 and 37. Therefore, a triangle is characterized and identified on both faces by the indices to which the coordinates of its vertices correspond. Although the indices are equal, the area and perspective of the triangle may vary from one face to the other, making the face-swapping process demanding.
To calculate the triangles of the target face, a series of matrix operations were used. They served to identify the three indices that make up each of the triangles of the source face. Once the homologous triangles had been determined, an affine transformation was used so that the source triangle corresponded to the size and shape of the target triangle. This way, triangles that were geometrically equal to those of the target were obtained, but with an image texture that corresponded to the source face.
Finally, it was necessary to aggregate the newly transformed triangles, so that they matched the mask of the target face, and then transpose them to the image in question. For that purpose, a mask was built on the target image in order to replace it with the mask obtained from the source triangles.
  2.5.2. Smoothing
After the transposition from one face to the other, the result can still be considered incomplete, since no contour smoothing or color-blending methods have been applied. Furthermore, the cinematic animation of the 3D model itself, with the skin surface already colored from the portrait image, requires an implementation that ensures a pleasant and realistic viewing experience for the user. With the face correctly transported to the target subject, the colors and contrast of the mask of the source subject are automatically adjusted. In this way, the mask is framed in the target subject in a more coherent and harmonious way.
To ensure smooth rotation of the patient’s face mask from frame to frame, a pre-processing of the 3D volume frames was performed where the fiducial points of each frame were calculated and registered. With the 68 points calculated for each frame, a smoothing adjustment was applied to the curve of each index as a function of time, resorting to a quadratic function. Through this operation, the fiducial points were recalculated so that when the process of photographic information transposition took place, a smoother animation could be obtained in its perspective transitions. The final result was exported in GIF format.
Regarding the smoothing of the trajectory curve of each of the 68 fiducial points, the coefficient of determination was used to assess the quality of the adjustment of the points to a quadratic function. For each patient, 68 coefficients were calculated for the adjustment of x-coordinates and 68 for y-coordinates as a function of time (frames). For this analysis, eight cases present in the database were studied.
  2.6. Survey to Evaluate the Results
The quality of the animations obtained was assessed using an online questionnaire. The results of the methodology were evaluated by a panel of specialist doctors in the area of Orthodontics, belonging to the Institute of Orthodontics at the Faculty of Medicine, University of Coimbra.
The questionnaire consists of 19 questions, 16 of which referred to four randomly selected cases while two were related to the method used and, finally, one question pertained to their professional experience. For each case, two different rendering versions were presented, designated as version A and B, in order to assess which of them might be better regarding rendering quality and photorealism. The rendering quality comprised the smoothness, clarity, and definition of the animation. Photorealism was related to the precision of the proportions and dimensions of the subject’s face. Both the rendering quality and the photorealism were evaluated on a scale of 10 values, with 1 representing low quality and 10 representing high quality. Among the four selected cases, a decoy case was generated, i.e., an animation was exported whose portrait photograph did not correspond to the CBCT volume, to assess the realism and accuracy of the method of fusion of 2D and 3D modalities. The analysis of responses was performed resorting to descriptive statistical methods.
  4. Discussion
The present study aimed to investigate the reconstruction of the external surface of the face by using CBCT and extraoral photographs of the patient.
Regarding the rendering of the CBCT volume processing, it was verified that the 3D representation was realistic to the patient’s characteristic dimensions and proportions. Despite not having developed a method to measure its accuracy, it was found that the data present in the DICOM volumes can be used together with photos to reconstruct the skin surface of the subject. The thresholding technique that was developed showed good segmentation and processing capabilities for all the subjects in the database. It is a simple and fast algorithm as opposed to machine learning techniques or the implementation of neural networks, which would require more computational power.
To export the final model, various options were considered (OBJ, PLY and STL); however the PNG format requires less memory and is the easiest to manipulate as it does not require specialized software for three-dimensional models. The low contrast of soft tissues in CBCT makes the segmentation of the skin complex; however, this problem was successfully managed. A downside of the methodology studied is the possibility that CBCT may contain metallic objects that serve as obstacles and prevent the complete representation of the patient, such as the presence of chin and head rests. As a solution, it is necessary to ensure that the CBCT is acquired with the maximum possible exposure of the patient’s head. The quality of smoothing of the model surface and noise removal are parameters that were only evaluated from a user perspective.
As for the image processing and fusion of 2D and 3D modalities, the fusion of DICOM images from CBCT with portrait photographs was successfully achieved.
The face recognition algorithm is faster to execute when compared to the face-swapping process. However, it presents limitations when it comes to its success rate at identifying all 68 fiducial points. The identification of the 68 fiducial points is not an obstacle for portrait photographs, but it becomes a challenge when it comes to the processing of 3D model frames as it is necessary that all locations corresponding to the 68-point map be clearly visible. On the other hand, this algorithm was not optimized to the recognition of faces in three-dimensional models, such as the ones used, because this presented less contrast.
The quadratic adjustment was adequate for the X coordinates, but the adjustment of the Y coordinates was less successful. Thus, it can be concluded that, according to the results obtained, the adjustment in the ordinates were of a lower quality.
Overall, the method for mapping faces onto three-dimensional models proved to be simpler and easier to process, with the disadvantage of a limited movement of the patient’s head model.
Lastly, the results of the questionnaires that were filled in by the orthodontists revealed that case 3, referred to as the decoy, performed its function well. Not only was the mean result lower when case 3 was considered, but also the mean deviation was significantly higher, which indicates that the confidence margin in the results was low. This case was the result of the combination of a CBCT volume and a portrait image of two different patients, in order to check whether a poor correspondence between the 3D and 2D modalities was perceptible. Therefore, it can be stated that the developed algorithm has a good ability to realistically replicate and map the patients’ facial features. The results also showed a superior performance of version A compared to version B. This is an unexpected result, given that a contrast smoothing function was applied to version B. A possible justification could be that version A may have greater definition and clarity of the photographic properties of the subject’s face. Furthermore, the photo-realism was better ranked compared to the rendering quality, with both showing results around 6/10. As for the coefficient variation, version B showed less dispersion, with mean deviations of less than 25%, and version A showed satisfactory values of 48% and 30%. Thus, in general, the experts’ ratings seem to be coherent, version A was preferred to version B, photo-realism and rendering quality presented similar ratings.