Siamese PointNet: 3D Head Pose Estimation with Local Feature Descriptor

Wang, Qi; Lei, Hang; Qian, Weizhong

doi:10.3390/electronics12051194

Open AccessArticle

Siamese PointNet: 3D Head Pose Estimation with Local Feature Descriptor

by

Qi Wang

,

Hang Lei

and

Weizhong Qian

^*

School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(5), 1194; https://doi.org/10.3390/electronics12051194

Submission received: 9 February 2023 / Revised: 23 February 2023 / Accepted: 27 February 2023 / Published: 1 March 2023

(This article belongs to the Special Issue Advanced Research and Applications of Deep Learning and Neural Network in Image Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Head pose estimation is an important part of the field of face analysis technology. It can be applied to driver attention monitoring, passenger monitoring, effective information screening, etc. However, illumination changes and partial occlusion interfere with the task, and due to the non-stationary characteristic of the head pose change process, normal regression networks are unable to achieve very accurate results on large-scale synthetic training data. To address the above problems, a Siamese network based on 3D point clouds was proposed, which adopts a share weight network with similar pose samples to constrain the regression process of the pose’s angles; meanwhile, a local feature descriptor was introduced to describe the local geometric features of the objects. In order to verify the performance of our method, we conducted experiments on two public datasets: the Biwi Kinect Head Pose dataset and Pandora. The results show that compared with the latest methods, our standard deviation was reduced by 0.4, and the mean error was reduced by 0.1; meanwhile, our network also maintained a good real-time performance.

Keywords:

head pose estimation; convolutional neural network; Siamese network

1. Introduction

Head pose estimation is an important part of the field of computer vision and also an important indicator for studying human behavior and attention. It can provide key information for many facial analysis tasks, such as face recognition, facial expression recognition, and driving concentration prediction [1]. The essence of the task is to predict the three pose angles (roll pitch yaw) of the object relative to the sensors. An effective algorithm should include the following main factors: a high accuracy, real-time performance, and the ability to cope with partial occlusion and large pose variations [2]. With respect to the above factors, many RGB-based head pose estimation algorithms have been proposed and achieved a very good performance [2]. However, the imaging quality of ordinary RGB sensors depends on light conditions, making them difficult to apply in some scenarios where light is weak or variable, such as night driving concentration detection, expression recognition in weak light environments, etc. [3]. With the development of depth sensors, it is more convenient to obtain high-quality depth images (also known as 2.5D images) [4]. Compared with ordinary RGB sensors, depth cameras have the following two main advantages. One advantage is that their infrared-based imaging principle—where each pixel represents the distance from the target to the sensors—makes the imaging quality mainly related to distances and is stable against variations in the light conditions; thus, it can be safely applied to human daily life [3]. The other advantage is that it can easily achieve background separation based on distance information, which can reduce the interference of the background and enable the task to focus on the object itself [1]. Depth maps can be easily converted into 3D point clouds by a simple coordinate transformation, which enables point clouds to inherit the above advantages of depth maps. Meanwhile, point clouds can better describe the spatial geometric feature of objects in 3D space, and the contours are stretched to have a more hierarchical appearance and clearer outlines; some important information around the outlines can be well retained [5].

Recently, many 3D methods based on different data types were proposed for face analysis, such as mesh, voxel grid, octree, and surface normal map. Compared with these four data types, the mathematical expression of a point cloud is more concise and can directly represent the spatial geometry information of an object. However, the disorder of the point clouds makes them difficult to apply to deep learning. Pioneers Qi et al. [6] relied on the idea of symmetric functions to solve the disorder of 3D point clouds and constructed PointNet. Many point clouds deep learning networks were proposed, such as those by Qi et al. [7], who optimized PointNet and proposed PointNet++. Deng et al. [8] introduced a local region representation to extract local features. Many point cloud methods were proposed for 3D computer vision. However, in the field of head pose estimation, due to the lack of the detailed textures of point clouds, the current pose estimation methods have not focused much attention on the local features of original point clouds, which leads to larger errors under large pose variations. Meanwhile, due to the non-stationary characteristic of the pose change process, previous regression networks were unable to achieve very good results on large-scale synthetic training data [9]. In order to deal with the above problems, we introduce a local feature descriptor coupled with a Siamese regression network for 3D head pose estimation. In our method, we first employ a local feature descriptor to describe the spatial geometric features of the objects; then, a group of PointNets is adopted to extract the local features, and three fully connected layers are used to map the head features to pose angles. Second, we utilize a share weight regression network with similar pose samples to guide the regression process of the pose angles. Finally, a novel loss function is introduced to constrain the difference between two similar features. In order to verify the effectiveness of the proposed method, we conduct experiments on two public datasets: the Biwi Kinect Head Pose dataset and Pandora.

The main contributions of this paper are summarized as follows:

We introduce a local feature descriptor to describe the detailed features of the point clouds to reduce the impact of their lack of detailed texture.
We present a new Siamese network to constrain the regression process of 3D head pose angles, which significantly reduced the errors of the original regression network. To the best of our knowledge, this is the first work to estimate 3D head poses by using a Siamese network.
The experimental results on public datasets show that our accuracy outperforms the latest approaches and also exhibits a good real-time performance.

2. Related Works

In recent years, the most widely used head pose estimation methods have mainly been proposed on RGB images. Drouard et al. [10] extracted HOG-based descriptors from face bounding boxes and mapped them to the corresponding head poses. Patacchiola et al. [11] proposed a convolutional neural network (CNN) supplemented with adaptive gradient methods to make the method robust for real-world applications. Hsu et al. [9] adopted a classification network to supervise the regression process of pose angles, which significantly improved the accuracy of the head pose estimation. Ruiz et al. [12] jointly combined pose classification and regression training with a multi-loss convolutional neural network on a large synthetically expanded dataset, which reduced the dependence on landmarks and enhanced the robustness of the network. Recently, Huang et al. [13] introduced a head pose estimation method using two-stage ensembles with average top-k regression, which combined the two subtasks by considering the task-dependent weights instead of setting coefficients by using grid search. Based on the driver’s head pose and multi-head attention, Mercat et al. [14] proposed a vehicle motion forecasting method. In order to cope with complex situations, Liu et al. [15] proposed a robust three-branch model with a triplet module and matrix Fisher distribution module. Considering the discontinuity of Euler angles or quaternions and the observation that MAE may not reflect the actual behavior, Cao et al. [16] proposed an annotation method which uses three vectors to describe the head poses and measurements using the mean absolute error of the vectors (MAEV) to assess the performance. Relying on head poses, Jha et al. [17] proposed a formulation based on probabilistic models to create salient regions describing the driver’s visual attention. In order to bridge the gap between better predictions and incorrectly labeled pose images, Liu et al. [18] introduced probability values to encode labels, which took advantage of the adjacent pose’s information and achieved a very good performance.

Compared to RGB images, depth maps cope well with dramatic light changes but lack texture detail [5], and very few studies only rely on depth maps [3]. Ballotta et al. [4] constructed a fully convolutional network to predict the location of the head’s center. Wang et al. [19] combined the perception of deep learning and the decision-making power of machine learning to propose a convolutional neural network for multi-target head center localization. Borghi et al. [1] converted the depth maps into gray-level images and motion images via the GAN network, and they combined them to predict the head pose; this method relies on three types of training samples and greatly improved the head pose’s prediction accuracy. Lei et al. [20] only relied on depth maps and constructed a one-shot network for face verification, which achieved a high accuracy with a small training sample. Recently, Wang et al. [21] employed an L2 norm to constrain head features in order to reduce the interference of partial occlusions for face verification.

As mentioned above, based on point clouds, many methods have been proposed and made breakthrough progress. Xiao et al. [2] utilized PointNet++ to extract the global features of the head and constructed a regression network for pose estimation. Xu et al. [22] presented a statistical and articulated a 3D human shape modeling pipeline, which captured various poses together with additional closeups of the individual’s head and facial expressions. Then, Xiao et al. [23] adopted a classification network associated with soft labels to supervise the regression process of the pose angles. Hu et al. [24] leveraged the 3D spatial structure of the face and combined it with bidirectional long short-term memory (BLSTM) layers to estimate head poses in naturalistic driving conditions. Considering that the point clouds lack texture, Zou et al. [25] combined gray images and proposed a sparse loss function for 3D face recognition. Recently, Ma et al. [26] combined PointNet and deep regression forests to construct a new deep learning method in order to improve the efficiency of the head pose estimations. Cao et al. [27] proposed the RoPS local descriptor to map local features to three different planes and leveraged FaceNet to achieve 3D face recognition with a high accuracy. Based on a multi-layer perceptron (MLP), Xu et al. [28] constructed a classification network to predict the probability of each angle, and they also combined it with a graph convolutional neural network to reduce computation and memory costs.

In our method, we employ a Siamese network to supervise the regression process of the pose angles. The Siamese network was first proposed by Bromley et al. [29]; they applied this network to signature and verification certificate tasks. Based on the Siamese network, many methods have been proposed for computer vision. Melekhov et al. [30] used a Siamese network to extract a pair of features and calculated the similarity to determine whether the images matched. Varga et al. [31] introduced a deep multi-instance learning approach for person re-identification. Considering the local patterns of the target and their structural relationships, Zhang et al. [32] proposed a local structure learning method, which provides more accurate target tracking. Recently, Wang et al. [33] conducted a formal study on the importance of asymmetry by explicitly distinguishing the two encoders within the network and exploiting the asymmetry for Siamese representation learning.

3. Methods

In this section, we first introduce PointNet for point cloud feature extraction, and we propose a local feature descriptor to describe the local regions. Second, we construct a head pose regression network for the pose estimation. Finally, a Siamese network with similar samples is introduced to guide the training process of the pose regression network.

3.1. Introduction of Point Clouds and Feature Extraction

A point cloud is a series of points in a 3D space, and it is expressed as matrix

n \times 3

, where n is the number of points and 3 represents the coordinate (x, y, z) of a point in the world coordinate system, but the sequence of the points of the same object is not necessarily consistent [5]; moreover, due to the disorder of the point clouds, they cannot have an index sequence similar to regular 2D images or 3D voxels to achieve weight sharing for convolution operations [34]. Solving the disorder of the point clouds and performing an effective feature extraction is the key factor for facial analysis based on point clouds [2]. According to Theorem 1, Qi et al. [6] utilized the idea of a symmetric function to construct a deep learning network in order to deal with the disorder of the point clouds.

Theorem 1.

Suppose

f : χ \to ℝ

is a continuous set function w.r.t Hausdorff distance

d_{H} (\cdot, \cdot)

\forall ε > 0

,

\exists

a continuous function h, and a symmetric function

g (x_{1}, x_{2}, x_{3,} \dots, x_{n}) = γ \circ M A X

such that for any

S \in χ

,

| f (S) - γ (\underset{x_{i} \in S}{M A X} {h (x_{i})}) | < ε

(1)

where

x_{1}, x_{2}, x_{3,} \dots, x_{n}

is the full list of elements in S ordered arbitrarily,

γ

is a continuous function, and MAX is a vector max operator that takes n vectors as the input and returns a new vector of the element-wise maximum.

Theorem 1 shows that if there are enough feature dimensions in the MAX operator, function f can be arbitrarily approximated by

γ (\underset{x_{i} \in S}{M A X} {h (x_{i})})

.

Inspired by Theorem 1, a multilayer perceptron (MLP) is adopted to construct the right side of Equation (2) in order to approximate the left side:

f (x_{1}, x_{2}, x_{3,} \dots, x_{n}) \approx γ \circ g (h (x_{1}), h (x_{2}), h (x_{3}), \dots, h (x_{n}))

(2)

where f and h are different general functions that map independent variables

(x_{1}, x_{2}, x_{3,} \dots, x_{n})

and

x_{i}

to different feature spaces

ℝ^{m}

and

ℝ^{l}

, respectively. G is a symmetric function (approximates the MAX operator in Theorem 1, and the function result is independent of the input order of the arguments). R is another general function

ℝ^{l} \to ℝ^{m}

which maps the result of the symmetric function g to feature space

ℝ^{m}

[5]. For a disordered point cloud, Qi et al. [6] employed a convolutional neural network as the MLP and a Max pooling layer as the symmetric function to extract the global feature of the object for classification and segmentation tasks. However, head pose estimation is a regression task, and it has difficulty achieving accurate results when only using global features. In this step, we adopt a shallow network structure, which deletes the transform net of PointNet, and we adjust the dimensions of each layer to make it suitable for local feature extractions in the next step. The structure of our proposed network is shown in Figure 1.

As shown in Figure 1, for an input point cloud object with n points, we use three convolutional layers with 64, 128, and 256 filters to map every point to a high-dimensional feature space:

ℝ^{3} \to ℝ^{64} \to ℝ^{128} \to ℝ^{256}

. Moreover, inspired by [6,35,36,37], a Max pooling layer was utilized as the symmetric function (the MAX operator in Theorem 1) to solve the disorder of the points set and to extract the feature in

ℝ^{256}

.

In order to ensure that the network has the same feature input dimension and can evenly sample the points, the farthest point sampling method is adopted to sample a fixed number of points for each object (each object samples 4096 points) before PointNet.

3.2. Local Feature Descriptor

Compared with RGB images, point clouds lack detailed textures, which results in difficulty in effectively characterizing objects by only using global features [2], and the position information of the points cannot directly reflect the geometric relationship between the points [8]. In order to enhance the description of the geometric details of the local region, in this step we adopt a local feature descriptor to describe the geometric characteristics of the local region.

For a pair of points

(p_{i}, p_{j})

in a local region, in order to describe the geometric relationship between two points, a four-dimensional descriptor is introduced:

ψ_{i j} = (‖ d ‖, ∠ (n_{i}, d), ∠ (n_{j}, d), ∠ (n_{i}, n_{j}))

(3)

where d is the vector, which represents the difference between two points in the feature space, and

‖ \cdot ‖

is the Euclidean distance.

n_{i}

and

n_{j}

are the normal vectors of

p_{i}

and

p_{j}

in the local region, respectively. As shown in Equation (4),

∠

is the angle between two normal vectors.

∠ (n_{i}, n_{j}) = a \tan 2 (‖ n_{i} \times n_{j} ‖, n_{i} \cdot n_{j})

(4)

The four-dimensional descriptor describes the spatial geometric characteristics of the points pair. For all points

{p_{1}, p_{2}, p_{3}, \dots, p_{j}}

in a local region, with

p_{i}

as the center and k as the radius (k is 0.4 in our method), we contain j point pairs with center point

p_{i}

. Then, the encoding method of this local region is expressed as Equation (5):

F_{i} = [p_{1}, n_{1}, p_{2}, n_{2}, \dots, p_{j}, n_{j}, ψ_{i 1}, ψ_{i 2} \dots, ψ_{i j}]

(5)

where

n_{j}

is the normal vector of point

p_{j}

, and

ψ_{i j}

is the four-dimensional feature descriptor between point

p_{j}

and center point

p_{i}

. As shown in Figure 2,

F_{i}

describes the spatial geometric characteristics of the local region via the local feature descriptor between all points with center point

p_{i}

in this local region.

3.3. Pose Prediction Network

In this section, we utilize the PointNet with the local feature descriptor to construct a prediction network for head pose estimations; the structure of the head pose prediction network is shown in Figure 3.

As shown in Figure 3, for an input object with 4096 points

{p_{1}, p_{2}, p_{3}, \dots, p_{4096}}

, we select every point as the center of a sphere with radius k (k is 0.4 in our method), and the points in the same sphere are regarded as being in the same local region

{L_{1}, L_{2}, L_{3}, \dots, L_{4096}}

. For each

L_{i}

, we adopt the local feature descriptor to describe the geometric characteristics of the local region:

{ψ_{1}, ψ_{2}, ψ_{3}, \dots, ψ_{4096}}

(

ψ_{i}

represents a local geometric characteristic of local region

L_{i}

). Then, PointNet, as shown in Figure 1, is utilized to extract the features of each

ψ_{i}

. After the above steps, we obtain a set of local features in high-dimensional feature space

{f_{1}, f_{2}, f_{3}, \dots, f_{4096}}

. Subsequently, a Max pooling layer is used to extract the entirety of feature

F_{w}

of all the local features

{f_{1}, f_{2}, f_{3}, \dots, f_{4096}}

. Finally, three fully connected layers with 256, 64, and 3 filters are adopted to map head feature

F_{w}

to three pose angles.

The loss function of our head pose prediction network is defined as follows:

L_{p r e d i c t} = \sum_{j}^{n} {‖ G_{j} - P_{j} ‖}_{2}^{2}

(6)

where

G_{i}

represents the ground truth of three pose angles (expressed in Euler angles: roll, pitch, and yaw), and

P_{i}

is the prediction value of our head pose prediction network.

3.4. Siamese Network for Pose Constraint

As described above, a regression network is constructed to predict head poses, but due to the non-stationary characteristic of the head pose change process, it is difficult for a single regression network to cope with large-scale synthetic training data [23], which will result in a large prediction error. In order to deal with the above problem, a Siamese network with similar samples was proposed to constrain the prediction values and guide the regression process of the pose prediction network.

The structure of the proposed Siamese network is shown in Figure 4. The network consists of two identical branches, which accept similar pose samples as the inputs and extract features. The ends of the two branches are connected by an energy function to compute the difference between the two features:

L_{e n e r g y} = \sum {}_{j}^{n}{‖ D_{n e t} (x_{i}) - D_{g t} (x_{i}) ‖}_{2}^{2}

(7)

D_{n e t} (x_{j}) = P_{1 j} - P_{2 j}

(8)

D_{g t} (x_{j}) = G_{1 j} - G_{2 j}

(9)

where

D_{n e t}

is the difference between the two predicted pose angles extracted by their own branch, and

D_{g t}

represents the difference in their ground truth [38].

Considering that the training dataset has a total of N samples, a large number of N/2 possible pairs can be used, and for a specific pair of samples

(S_{i}, S_{j})

, only those with at least

γ

degrees of the total difference between all the pose angles (ground truth value) are selected:

| G_{s_{i}} - G_{s_{j}} | < γ

(10)

where

γ

determines the similarity of the pair of samples. In the training process, the energy function

L_{e n e r g y}

is also regarded as the loss function of the Siamese network.

Compared with a single-branch network, the proposed Siamese network has two main advantages. First, the parameters between the identical networks are shared, which can guarantee that a pair of very similar samples is not mapped to very different locations in a feature space by the respective networks. Second, as the loss function (

L_{e n e r g y}

) converges during training, similar pose samples within

γ

are extracted by their own network, which enables two regression networks to supervise each other and prevents the other side from being mapped to a more distant area in the feature space. In the testing stage, we only employ one pose prediction network to estimate the head pose (the parameters of the two networks are tied).

The hyperparameters of our Siamese network are as follows: the learning rate is 0.001, the decay rate is 0.99, the batch size is 64, and the decay step size is 500.

4. Experiments

In this section, we first introduce two public datasets for experiments: the Biwi Kinect Head Pose dataset and Pandora. Second, in order to verify the effect of the local feature descriptor and investigate similarity

γ

in Equation (10), we conduct ablation experiments on the Biwi Kinect Head Pose dataset. Third, we investigate the influence of the input number of points. Finally, we use our best results for comparison experiments with the latest methods and analyze the results.

4.1. Datasets

With respect to the Biwi Kinect Head Pose dataset, Fanelli et al. [39] utilized Kinect to collect this dataset. This dataset has a total of more than 15,000 head pose images, each object contains depth maps and the corresponding RGB images, and the resolution is

640 \times 480

. Biwi records 24 sequences of 20 different objects (6 females and 14 males, some of them are recorded twice). It is a challenging dataset with various head poses and partial occlusion. The test set includes sequences 11 and 12, which contain around 1304 images, and the training set contains the remaining 22 sequences, which contain around 14,000 images.

With respect to the Pandora dataset, Borghi et al. [1] collected this dataset specifically for head and shoulder pose estimations. Pandora has a total of more than 250,000 images, and each object contains depth maps (the resolution is

512 \times 424

) and corresponding RGB images (the resolution is

1920 \times 1080

). The dataset records 110 sequences of 10 male and 12 female objects. The recorded objects belonging to the upper body contain various postures, hairstyles, glasses, scarf, etc.

The above two datasets only provide RGB and depth images. We should transform depth images to point clouds before sending them into the Siamese network. First, we directly use the ground truth of the head center

H_{c}

with its depth value

D_{c}

to obtain the head areas (head detection is not the focus of our method), and we remove the background: we set the depth value as greater than

D_{c} + 300

to 0 (300 is the general amount of space for a real head and expressed in mm). Second, we transform the depth map from an image coordinate system to a world coordinate system.

[\begin{matrix} x \\ y \\ z \end{matrix}] = D_{c} [\begin{matrix} \frac{1}{f_{x}} & 0 & 0 \\ 0 & \frac{1}{f_{y}} & 0 \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x^{i} \\ y^{i} \\ 1 \end{matrix}]

(11)

(x^{i}, y^{i})

denotes the pixel in the image coordinate system, and

f_{x}

and

f_{y}

represent the horizontal and vertical focal length of the internal parameters of the depth sensors.

(x, y, z)

is the position of the point converted from the pixel. Figure 5 shows examples of RGB images, depth maps, and point clouds from the Biwi Kinect Head Pose and Pandora datasets.

Both Biwi Kinect Head Pose and Pandora datasets provide ground truth pose angles (roll, pitch, and yaw). In our experiments, according to previous methods [1,2,23,26,38], we use the mean of the absolute values and the standard deviation to quantitatively evaluate the accuracy:

S = δ \pm β

(12)

δ

denotes the mean of the absolute values (MAE) of the difference between all ground truth and predicted values, and

β

is the standard deviation of the absolute values (SD) of the difference between all ground truth and predicted values.

4.2. Ablation Experiments

As described in Section 3.2, we introduced a local feature descriptor to describe the local region. In order to verify the effect of our method, according to the method in [7], we replace the local feature descriptor and only use the position information of the points to describe the local region.

In this section, to intuitively demonstrate the effect of the descriptor, we only use a single branch, as shown in Figure 3, to conduct the ablation experiment. The results are reported in Table 1.

As shown in Table 1, the accuracy of the head pose prediction network greatly improved with the local feature descriptor, where the MAE is reduced by 0.3 and the SD is reduced by 0.2. This is because the descriptor provides the network with detailed local geometric features, which are more conducive to the extraction of the pose characteristics. On the other hand, our method would lead to extra computational costs, but it still maintains a good real-time performance. The results in Table 1 prove the effectiveness of the proposed local feature descriptor.

According to Equation (10) in Section 3.4,

γ

represents the similarity of the pair of samples. For a deep learning network, training samples are a key factor for the performance. In this step, we conduct comparison experiments on Biwi to decide the best

γ

for the Siamese network; the results are reported in Table 2.

As shown in Table 2, when

γ = 0

, the inputted pair of samples has the same pose angles (same sample), and when the loss function is

L_{e n e r g y} = 0

, the constraint of the Siamese network is not utilized. As

γ

increases, the two branches of the Siamese network start to constrain each other. When

γ = 15

, our network achieves the best results. As

γ

continues to increase, the accuracy begins to decline. This is because obtaining more similar pose samples is more conducive to constraining pose angles within a smaller range. However, when

γ

is too small, the Siamese network also cannot achieve the best results because the pose features of the samples are too close, which makes it difficult for the Siamese network to distinguish the difference.

Figure 6 shows the prediction accuracy with different

θ

metrics. For each pose angle, if the absolute value of the difference between the prediction value and the ground truth is less than

θ

, the pose angle is regarded as accurately predicted. According to Figure 6, when

θ

is too small, the accuracy is obviously low. When the total difference,

γ

, is 5 and 10, the difference in the head pose is quite small, especially for a certain angle.

According to Table 2, we set

γ = 15

as our best result for comparison experiments. Figure 7 shows the curves of the loss function and the accuracy of the Siamese network during training when

γ = 15

.

4.3. Input Number of Points

As mentioned above, we sampled 4096 points for each object, but the number of input points affects the performance of the network. As shown in Figure 8, this is because the number of points affects the detailed information of the object and also determines the number of local regions. In this section, we investigate the input number of the points by losing half points for each step (we also adopt the farthest point sampling method to the sample points). The results are reported in Table 3.

As reported in Table 3, our network has a higher accuracy when there is an increased input number of points, which indicates that more points are beneficial for describing the more detailed features of the object and can significantly improve the accuracy of the network, but time consumption increases. However, 4096 points with 288 fps also maintained a good real-time performance for most applications.

4.4. Comparison Experiments

In this section, we conduct comparison experiments on two public datasets, and we analyze the results. Table 4 reports the comparison of the results with the latest methods on the Biwi Kinect Head Pose dataset.

Table 4 lists a comparison of the experimental results on the Biwi Kinect Head Pose dataset. The methods in [13,16,18] only report their MAE, and other methods report

M A E \pm S D

. As shown in Table 4, the accuracy of the depth and point cloud methods is obviously higher than the RGB methods. This is because geometric information is more conducive to the extraction of the pose features, especially under partial occlusion and large pose interferences. Compared with depth maps, point clouds have more abundant geometric information and clearer contours, which are more beneficial to pose feature extraction. Although Borghi et al. [1] achieved a very high accuracy and only relied on depth maps, they used two Gan networks to generate gray and motion images, which leveraged three types of images to jointly predict the head pose, and the entire network structure is too complex.

As per the results reported in Table 4, compared with the methods in [1], our MAE was reduced by 0.1, and compared with the methods in [26], although their MAE is lower, our SD was reduced by 0.4. Overall, the accuracy of our method is higher than that of the other methods.

In order to intuitively show the test results on the Biwi Kinect Head Pose dataset, Figure 9a shows the ground truth and the prediction values of all the test samples, and Figure 9b shows the error distributions for each pose angle. As shown in Figure 9, the prediction results are very close to the ground truth, and the error distribution is convergent.

Table 5 lists a comparison of the experimental results on the Pandora dataset, which contains more abundant samples with a series of large body gestures and partial occlusion. As reported in Table 5, our accuracy outperforms the latest methods. Compared with Xiao et al. [23], our accuracy is very close to theirs, and only the MAE was reduced by 0.1, but for each pose angle, our MAE and SD were better than theirs, except for the SD of the roll angle. Figure 10 shows the examples of our method on Pandora.

As shown in Figure 10, our method can cope well with pose predictions with respect to various pose changes and provide an accurate pose angle estimation.

For the head pose estimation task, except for the accuracy, the time cost is also an important indicator for measuring performance, which determines whether the method can be applied to real application scenarios. Table 6 lists a comparison of different methods in terms of time costs. Because different data types are processed in different ways, for a fair comparison, we only conducted comparisons with point cloud methods.

As shown in Table 6, compared with recent head feature extraction methods, our method is faster. This is because the local feature descriptor described the spatial geometric features of the local regions in detail before the deep learning network, which allows us to adopt a shallow network to extract the features and enables the network to maintain a good real-time performance.

Combining Table 4 and Table 5, it is noticeable that our accuracy outperforms the latest methods, and Table 6 proves that our network also has a good real-time performance.

We conducted our experiments on the following operating system: Ubuntu16.04. The used hardware is listed as follows: the GPU is NVIDIA GTX1080ti, the CPU is Intel Core i7 (3.40 GHz), the display is SAMSUNG S27R350FHC (75 Hz, resolution:

1920 \times 1080

), and the depth cameras are Kinect v2 (resolution:

640 \times 480

) for the Biwi Kinect Head Pose dataset and the Kinect one (resolution:

512 \times 424

) for the Pandora dataset.

5. Conclusions

In this study, in order to cope with the non-stationary characteristic of the head pose change process, a new Siamese network with a local feature descriptor was constructed for 3D head pose estimations. In the feature extraction stage, a four-dimensional descriptor is introduced to describe the geometrical relationship between a pair of points, which can describe the geometric characteristics of the local regions in detail. In the head pose estimation stage, similar pose samples were used to constrain the regression process of the pose angles. Ablation experiments proved the effectiveness of the local feature descriptor, and the results of the experiments on public datasets show that compared with the latest methods, our accuracy outperformed the other methods (where SD was reduced by 0.4 and MAE was reduced by 0.1). Simultaneously, the proposed method also maintained real-time performance and can be applied to real application scenarios. However, in the case of partial occlusions, the accuracy is still not sufficient. In future studies, we will further explore algorithms and optimize the network and explore new methods for other 3D face analysis technologies.

Author Contributions

Conceptualization, Q.W.; data curation, Q.W.; formal analysis, Q.W. and W.Q.; investigation, Q.W.; methodology, Q.W.; project administration, H.L.; resources, Q.W.; software, Q.W.; supervision, H.L.; visualization, Q.W.; writing—original draft, Q.W.; writing—review and editing W.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the National Natural Science Foundation of China (61802052).

Conflicts of Interest

The authors declare no conflict of interest.

References

Borghi, G.; Fabbri, M.; Vezzani, R.; Calderara, S.; Cucchiara, R. Face-from-Depth for Head Pose Estimation on Depth Images. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 596–609. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Xiao, S.; Sang, N.; Wang, X. 3D point cloud head pose estimation based on deep learning. J. Comput. Appl. 2020, 40, 996. [Google Scholar]
Ballotta, D.; Borghi, G.; Vezzani, R.; Cucchiara, R. Head detection with depth images in the wild. arXiv 2017, arXiv:1707.06786. [Google Scholar]
Ballotta, D.; Borghi, G.; Vezzani, R.; Cucchiara, R. Fully convolutional network for head detection with depth images. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 752–757. [Google Scholar]
Wang, Q.; Qian, W.Z.; Lei, H.; Chen, L. Siamese Neural Pointnet: 3D Face Verification under Pose Interference and Partial Occlusion. Electronics 2023, 12, 620. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
Deng, H.; Birdal, T.; Ilic, S. Ppfnet: Global Context Aware Local Features for Robust 3D Point Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 195–205. [Google Scholar]
Hsu, H.W.; Wu, T.Y.; Wan, S.; Wong, W.H.; Lee, C.-Y. Quatnet: Quaternion-based head pose estimation with multiregression loss. IEEE Trans. Multimed. 2018, 21, 1035–1046. [Google Scholar] [CrossRef]
Drouard, V.; Ba, S.; Evangelidis, G.; Deleforgr, A.; Horaud, R. Head pose estimation via probabilistic high-dimensional regression. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 4624–4628. [Google Scholar]
Patacchiola, M.; Cangelosi, A. Head pose estimation in the wild using convolutional neural networks and adaptive gradient methods. Pattern Recognit. 2017, 71, 132–143. [Google Scholar] [CrossRef] [Green Version]
Ruiz, N.; Chong, E.; Rehg, J.M. Fine-grained head pose estimation without keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2074–2083. [Google Scholar]
Huang, B.; Chen, R.; Xu, W.; Zhou, Q. Improving head pose estimation using two-stage ensembles with top-k regression. Image Vis. Comput. 2020, 93, 103827. [Google Scholar] [CrossRef]
Mercat, J.; Gilles, T.; El Zoghby, N.; Sandou, G. Multi-head attention for multi-modal joint vehicle motion forecasting. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–4 June 2020; pp. 9638–9644. [Google Scholar]
Liu, H.; Fang, S.; Zhang, Z.; Li, D.; Lin, K. MFDNet: Collaborative poses perception and matrix Fisher distribution for head pose estimation. IEEE Trans. Multimed. 2021, 24, 2449–2460. [Google Scholar] [CrossRef]
Cao, Z.; Chu, Z.; Liu, D.; Chen, Y. A vector-based representation to enhance head pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 1188–1197. [Google Scholar]
Jha, S.; Busso, C. Estimation of Driver’s Gaze Region from Head Position and Orientation Using Probabilistic Confidence Regions. IEEE Trans. Intell. Veh. 2022, 8, 59–72. [Google Scholar] [CrossRef]
Liu, H.; Liu, T.; Zhang, Z.; Arun Kumar, S.; Yang, B.; Li, Y. ARHPE: Asymmetric relation-aware representation learning for head pose estimation in industrial human–computer interaction. IEEE Trans. Ind. Inform. 2022, 18, 7107–7117. [Google Scholar] [CrossRef]
Wang, Q.; Lei, H.; Ma, X.; Xiao, S.; Wang, X. CNN Network for Head Detection with Depth Images in cyber-physical systems. In Proceedings of the 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics), Rhodes, Greece, 2–6 November 2020; pp. 544–549. [Google Scholar]
Wang, Q.; Lei, H.; Wang, X. A Siamese Network for Face Verification with Depth Images. In Proceedings of the 2021 International Conference on Intelligent Technology and Embedded Systems (ICITES), Chengdu, China, 31 October–2 November 2021; pp. 138–143. [Google Scholar]
Wang, Q.; Lei, H.; Wang, X. Deep face verification under posture interference. J. Comput. Appl. 2022, 43, 595–600. [Google Scholar] [CrossRef]
Xu, H.; Bazavan, E.G.; Zanfir, A.; Freeman, W.T.; Sukthankar, R.; Sminchisescu, C. Ghum & ghuml: Generative 3d human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6184–6193. [Google Scholar]
Xiao, S.; Sang, N.; Wang, X.; Ma, X. Leveraging Ordinal Regression with Soft Labels for 3D Head Pose Estimation from Point Sets. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1883–1887. [Google Scholar]
Hu, T.; Jha, S.; Busso, C. Temporal head pose estimation from point cloud in naturalistic driving conditions. IEEE Trans. Intell. Transp. Syst. 2021, 23, 8063–8076. [Google Scholar] [CrossRef]
Zou, H.; Sun, X. 3D Face Recognition Based on an Attention Mechanism and Sparse Loss Function. J. Electron. 2021, 10, 2539. [Google Scholar] [CrossRef]
Ma, X.; Sang, N.; Xiao, S.; Wang, X. Learning a Deep Regression Forest for Head Pose Estimation from a Single Depth Image. J. Circuits Syst. Comput. 2021, 30, 2150139. [Google Scholar] [CrossRef]
Cao, Y.; Liu, S. RP-Net: A PointNet++ 3D face recognition algorithm integrating RoPS local descriptor. IEEE Access 2022, 10, 91245–91252. [Google Scholar] [CrossRef]
Xu, Y.; Jung, C.; Chang, Y. Head pose estimation using deep neural networks and 3D point clouds. Pattern Recognit. 2022, 121, 108210. [Google Scholar] [CrossRef]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS), Denver, CO, USA, 7–11 December 1994; pp. 737–744. [Google Scholar]
Melekhov, I.; Kannala, J.; Rahtu, E. Siamese network features for image matching. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 378–383. [Google Scholar]
Varga, D.; Szirányi, T. Person re-identification based on deep multi-instance learning. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 1559–1563. [Google Scholar]
Zhang, Y.; Wang, L.; Qi, J.; Wang, D.; Feng, M.; Liu, H. Structured siamese network for real-time visual tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 351–366. [Google Scholar]
Wang, X.; Fan, H.; Tian, Y.; Kihara, D.; Chen, X. On the importance of asymmetry for siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16570–16579. [Google Scholar]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 2–8 December 2018; pp. 820–830. [Google Scholar]
Guerrero, P.; Kleiman, Y.; Ovsjanikov, M.; Mitra, N.J. Pcpnet learning local shape properties from raw point clouds. In Computer Graphics Forum; Wiley: Hoboken, NJ, USA, 2018; Volume 37, pp. 75–85. [Google Scholar]
Ju, Y.; Peng, Y.; Jian, M.; Gao, F.; Dong, J. Learning conditional photometric stereo with high-resolution features. Comput. Vis. Media 2022, 8, 105–118. [Google Scholar] [CrossRef]
Chen, G.; Han, K.; Wong, K.Y.K. PS-FCN: A flexible learning framework for photometric stereo. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–18. [Google Scholar]
Venturelli, M.; Borghi, G.; Vezzani, R.; Cucchiara, R. From depth data to head pose estimation: A siamese approach. arXiv 2017, arXiv:1703.03624. [Google Scholar]
Fanelli, G.; Gall, J.; Van Gool, L. Real time head pose estimation with random regression forests. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 21–23 June 2011; pp. 617–624. [Google Scholar]

Figure 1. The structure of PointNet for extracting features of the point object. The MLP consists of three convolutional layers with filters 64, 128, and 256. The convolution kernel is

1 \times 1

.

Figure 1. The structure of PointNet for extracting features of the point object. The MLP consists of three convolutional layers with filters 64, 128, and 256. The convolution kernel is

1 \times 1

.

Figure 2. Schematic diagram of spatial geometric characteristics with center point

p_{i}

in a local region.

Figure 2. Schematic diagram of spatial geometric characteristics with center point

p_{i}

in a local region.

Figure 3. The structure of the head pose prediction network with local feature descriptors. In the figure, 3 fully connected layers are used to map head features to pose angles, where the last layer has 3 filters, which represent 3 pose angles (roll, pitch, and yaw).

Figure 4. The structure of the Siamese network for head pose estimations. Two shared weight networks extract similar pose objects, and an energy function (the loss function of the Siamese network) concatenates the prediction results of two branches to constrain the prediction values and guide the regression process.

Figure 5. Examples of Biwi Kinect Head Pose dataset (a) and Pandora dataset (b). The first line is the RGB images, and the second and third lines are the corresponding depth maps and point clouds, respectively.

Figure 6. (a–c) represent the Curves of prediction accuracies with different metrics

θ

for Roll, Pitch, Yaw respectively.

Figure 6. (a–c) represent the Curves of prediction accuracies with different metrics

θ

for Roll, Pitch, Yaw respectively.

Figure 7. Curves of accuracy and loss when training our network at

γ = 15

.

Figure 7. Curves of accuracy and loss when training our network at

γ = 15

.

Figure 8. Examples of different input numbers of points, where (a,b) are corresponding RGB images and depth maps, and (c–e) represent input point clouds with 4096, 2048, and 1024 points, respectively.

Figure 9. Results of the Biwi dataset: (a) reports the comparison between the ground truth and the predicted value for each frame (ground truth is the black line). (b) reports the error distributions for each angle.

Figure 10. Examples on the Pandora dataset. (a–d) are different objects with variable head poses. The first rows show the RGB images and the corresponding depth maps of head regions. The second rows show the point clouds of the objects and the pose prediction results, where the red arrows are the ground truth, and the dark blue arrows are the prediction values.

Table 1. Performance evaluation with different local region expressions on the Biwi Kinect Head Pose dataset.

Local Region	Position	Local Feature Descriptor
Roll	$2.2 \pm 2.6$	$1.7 \pm 2.0$
Pitch	$2.4 \pm 2.1$	$2.0 \pm 2.2$
Yaw	$2.4 \pm 2.2$	$2.4 \pm 2.1$
Avg	$2.3 \pm 2.3$	$2.0 \pm 2.1$
fps	385	288

Table 2. Performance evaluation with different

γ

on the Biwi Kinect Head Pose dataset.

Table 2. Performance evaluation with different

γ

on the Biwi Kinect Head Pose dataset.

$γ$	Roll	Pitch	Yaw	Avg
0	$1.7 \pm 2.0$	$2.0 \pm 2.2$	$2.4 \pm 2.1$	$2.0 \pm 2.1$
5	$1.7 \pm 2.0$	$1.9 \pm 2.1$	$2.2 \pm 2.1$	$2.0 \pm 2.1$
10	$1.5 \pm 1.9$	$1.7 \pm 2.0$	$2.2 \pm 1.9$	$1.8 \pm 1.9$
15	$1.3 \pm 1.7$	$1.5 \pm 1.8$	$2.2 \pm 1.7$	$1.6 \pm 1.7$
20	$1.3 \pm 1.7$	$1.6 \pm 1.8$	$2.3 \pm 1.8$	$1.7 \pm 1.8$
25	$1.4 \pm 1.7$	$1.7 \pm 2.0$	$2.3 \pm 1.8$	$1.8 \pm 1.8$
30	$1.5 \pm 1.8$	$1.8 \pm 2.0$	$2.4 \pm 1.9$	$1.9 \pm 1.9$
35	$1.6 \pm 1.9$	$1.9 \pm 2.2$	$2.4 \pm 2.1$	$2.0 \pm 2.0$
40	$1.9 \pm 2.2$	$2.2 \pm 2.3$	$2.4 \pm 2.2$	$2.2 \pm 2.2$
45	$2.3 \pm 2.4$	$2.4 \pm 2.5$	$2.5 \pm 2.3$	$2.4 \pm 2.4$
50	$2.5 \pm 2.8$	$2.6 \pm 2.7$	$2.5 \pm 2.5$	$2.5 \pm 2.7$

Table 3. Results of the different input numbers of points on the Biwi Kinect Head Pose dataset.

Input Number	Acc	fps
4096	$1.6 \pm 1.7$	288
2048	$1.7 \pm 1.9$	398
1024	$2.0 \pm 2.2$	558

Table 4. Comparison of results achieved by different methods on the Biwi Kinect Head Pose dataset.

Methods	Input	Roll	Pitch	Yaw	Avg
Venturelli et al. [38]	Depth	$2.1 \pm 2.2$	$2.3 \pm 2.7$	$2.8 \pm 3.3$	$2.4 \pm 2.7$
Borghi et al. [1]	Depth	$1.8 \pm 1.8$	$1.6 \pm 1.7$	$1.7 \pm 1.5$	$1.7 \pm 1.7$
Xiao et al. [2]	Point cloud	$1.5 \pm 1.4$	$2.3 \pm 1.7$	$2.4 \pm 1.8$	$2.1 \pm 1.6$
Huang et al. [13]	RGB	3.1	5.2	4.6	4.3
Ma et al. [26]	Point cloud	$1.4 \pm 2.0$	$1.5 \pm 2.3$	$1.5 \pm 2.1$	$1.5 \pm 2.1$
Cao et al. [16]	RGB	4.1	4.8	3.0	4.0
Liu et al. [18]	RGB	2.6	4.7	3.4	3.6
Ours	Point cloud	$1.3 \pm 1.7$	$1.5 \pm 1.8$	$2.2 \pm 1.7$	$1.6 \pm 1.7$

Table 5. Comparison of results achieved by different methods on the Pandora dataset.

Methods	Input	Roll	Pitch	Yaw	Avg
Borghi et al. [1]	Depth	$5.4 \pm 5.1$	$6.5 \pm 6.6$	$10.4 \pm 11.8$	$7.4 \pm 7.8$
Xiao et al. [23]	Point cloud	$4.3 \pm 4.5$	$6.1 \pm 5.6$	$8.6 \pm 9.8$	$6.3 \pm 6.6$
Ma et al. [26]	Point cloud	$4.9 \pm 7.4$	$6.4 \pm 10.5$	$9.6 \pm 15.3$	$7.0 \pm 11.0$
Ours	Point cloud	$4.3 \pm 4.7$	$6.0 \pm 5.2$	$8.3 \pm 9.8$	$6.2 \pm 6.6$

Table 6. Comparison of different methods in terms of time costs.

Methods	fms
Xiao et al., 2020 [2]	125
Xiao et al., 2020 [23]	117
Wang et al., 2022 [21]	148
Wang et al., 2023 [5]	225
Ours	288

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Q.; Lei, H.; Qian, W. Siamese PointNet: 3D Head Pose Estimation with Local Feature Descriptor. Electronics 2023, 12, 1194. https://doi.org/10.3390/electronics12051194

AMA Style

Wang Q, Lei H, Qian W. Siamese PointNet: 3D Head Pose Estimation with Local Feature Descriptor. Electronics. 2023; 12(5):1194. https://doi.org/10.3390/electronics12051194

Chicago/Turabian Style

Wang, Qi, Hang Lei, and Weizhong Qian. 2023. "Siamese PointNet: 3D Head Pose Estimation with Local Feature Descriptor" Electronics 12, no. 5: 1194. https://doi.org/10.3390/electronics12051194

APA Style

Wang, Q., Lei, H., & Qian, W. (2023). Siamese PointNet: 3D Head Pose Estimation with Local Feature Descriptor. Electronics, 12(5), 1194. https://doi.org/10.3390/electronics12051194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Siamese PointNet: 3D Head Pose Estimation with Local Feature Descriptor

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Introduction of Point Clouds and Feature Extraction

3.2. Local Feature Descriptor

3.3. Pose Prediction Network

3.4. Siamese Network for Pose Constraint

4. Experiments

4.1. Datasets

4.2. Ablation Experiments

4.3. Input Number of Points

4.4. Comparison Experiments

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI