3.1. Data Pre-Processing
To reduce the noise caused by the irrelevant background and the motion artifacts in the videos, we use the face as the ROI for subsequent calculation. The face detection algorithm MTCNN [
26] is used to detect the face of the videos in the dataset, as shown in
Figure 5. In order to improve the processing speed of face detection, after recognizing the face position coordinates of the first frame, we store them. The same face position coordinates are used for the next 10 frames, and then the algorithm performs face detection again and updates the face position coordinates, and so on.
Since the ground truth PPG signals also have some noise, to alleviate the difficulty of training, the ground truth PPG signals are processed with a band-pass finite impulse response (FIR) filter with a cutoff frequency range of 0.5–4 Hz, which covers the human heart rate range 0.5–2.5 Hz. It not only retains the second harmonic and signal information in the original signal but also removes noise and smoothes the curve. To improve the training efficiency, we further normalize the filtered signals. Because the videos in the dataset used in this paper are all 30 FPS, in order to ensure the one-to-one correspondence between the ground truth PPG and the videos, we also downsample the PPG signal to 30 Hz.
3.2. 3D Depth-Wise Separable Convolution
The calculation process of the standard 3D convolution is that the convolution kernels extract features from the input feature maps, and then combine the extracted features into new feature maps for output. As shown in
Figure 6, the size of input feature maps is
, where
is the length and width of the input feature maps, T is the temporal dimension, and M is the number of input channels. The size of output feature maps is
, where
is the length and width of the output feature maps, T is the temporal dimension, and N is the number of output channels. The size of the filters is
, which is composed of M convolution kernels. The length, width and temporal dimension of the convolution kernels are all
. In the standard 3D convolution, the number of filters are the same as output channels, and the number of convolution kernels that compose the filter are the same as input channels.
Then the standard 3D convolution multiplication calculation amount is as follows:
The main difference between 3D depth-wise separable convolution and 2D depth-wise separable convolution is that 3D depth-wise separable convolution has one more temporal dimension. 3D depth-wise separable convolution decomposes the standard 3D convolution into two stages, depth-wise convolution (DWC) and point-wise Convolution (PWC), as shown in
Figure 7 and
Figure 8.
In depth-wise convolution, each channel of the input feature maps will be treated as a separate feature map, and each feature map uses the filter composed of only one convolution kernel to perform a convolution operation with it to obtain the same number of output feature maps as the input. The number of filters are the same as output channels. The final output can be obtained by superimposing these feature maps, and the calculation is as follows:
In the above process, since the number of input channels is the same as the output, the feature map cannot be expanded. Moreover, this operation independently performs convolution operations on each channel of the input layer, and it does not effectively use the feature information of different channels in the same spatiotemporal position. In order to solve these problems, it is necessary to use the following point-wise convolution to combine these features.
In point-wise convolution, filters composed of multiple
convolution kernels are used to perform further feature extraction on the feature maps generated by depth-wise convolution. It is very similar to the standard 3D convolution process, except that the size of the convolution kernels that compose the filters used for point-wise convolution is all
. After point-wise convolution, the feature maps generated in the previous step will be weighted and fused in the depth direction to obtain the final output feature maps of the depth-wise separable convolution. The calculation of 3D point-wise convolution is as follows:
Add the calculation amount of depth-wise convolution and point-wise convolution together to get the calculation of 3D depth-wise separable convolution:
The ratio of the calculation amount of 3D depth-wise separable convolution to the calculation amount of standard 3D convolution is:
From the above calculation, it can be seen that the calculation amount of 3D convolution is mainly concentrated in the process of feature extraction. Here, 3D depth-wise separable convolution reduces the number of feature extractions and increases the number of features merging by decomposing the convolution into depth-wise convolution and point-wise convolution, which greatly reduces the calculation of the model while having a small impact on the accuracy of the network. It is precisely because of this advantage of the depth-wise separable convolution that it is used as the basic convolution module in ESA-rPPGNet.
3.3. 3D Shuffle Attention Block
Attention mechanism first appeared in the fields of natural language processing and machine translation, and achieved good results. Through some scholars’ explorations, the attention mechanism has gradually been applied in the field of computer vision. The attention mechanism in deep learning is similar to the attention mechanism of human vision; that is, it focuses on important points in a lot of information, selects key information, and ignores other unimportant information. Adding the attention mechanism to the convolutional neural network can effectively improve the performance of the network.
The attention mechanism in computer vision is mainly divided into two categories, spatial attention and channel attention, which are used to capture pixel-level relationships and channel-level relationships. For 2DCNN, there are already many outstanding attention modules, such as SE block, based on channel attention [
27] and CBAM [
28], that combines channel attention and spatial attention. However, they all have their own shortcomings. SE block only uses channel attention, and does not make full use of the correlation between spatial attention and channel attention. Although CBAM combines channel attention and spatial attention, it inevitably increases the calculation and size of the model.
To solve these problems, shuffle attention (SA) [
29] first obtains multiple sub-features by grouping channel dimensions. Then, SA uses the channel and spatial attention mechanism for each sub-feature at the same time. Finally, SA gathers all the sub-features together and uses the channel shuffle [
30] operation to fuse the sub-features of different groups. Based on the SA module, this paper constructs a 3D shuffle attention (3D-SA) module for 3DCNN, as shown in
Figure 9.
Consider the input feature map , where C, T, H, W indicate the number of channels, temporal dimension, spatial height and width, respectively. First, the feature map X is divided into G groups along the channel dimension, i.e., . Each sub-feature gradually captures specific semantics during the training process. For each , we will divide it into two branches along the channel dimension, i.e., . These two branches use channel attention and spatial attention, respectively, to generate channel attention maps and spatial attention maps. In this way, the model can know what to pay attention to and where it is meaningful to pay attention.
In order to design a more lightweight attention module in the channel attention branch, instead of directly using the SE block, a lightweight implementation is used [
29]. In order to be able to embed the global information, global average pooling (GAP) is used to generate the channel-wise global statistical information as
, which reduces the temporal and spatial dimensions
of
:
In addition, a simple gating mechanism is implemented by using the sigmoid activation function to create a compact feature to guide more precise and adaptive channel selection. Therefore, the final output of the channel attention branch can be obtained by
where
and
are parameters used to scale and shift s.
Unlike channel attention, which pays more attention to “what ”, spatial attention pays more attention to “where ”. The information obtained by spatial attention can be used as a supplement to the information obtained by channel attention. In the spatial attention branch,
is firstly group normalized (GroupNorm, GN) [
31] to obtain spatial level information [
29]; then
is used to enhance the spatial features, and finally, by using the sigmoid activation function, a simple gating mechanism is created. The final output of the spatial attention branch can be obtained by
where
and
.
After getting the output of the channel attention and the spatial attention branch, the output of the two branches are concatenated along the channel dimension to get the same number of channels as the number of input, i.e.,
. When all the sub-features are aggregated, the channel shuffle operation similar to ShuffleNet v2 [
30] is used to make the information between the cross-groups flow along the channel dimension, and finally the output of the 3D-SA module is obtained. Since the output dimension of the network is the same as the input dimension, the 3D-SA block can be easily applied to the 3DCNN network model.
In the non-contact rPPG extraction task, we need the network to be able to learn the skin color change caused by changes in the pulse on the skin of the face. By introducing an attention mechanism, our network can focus more on the skin color changes, ignoring the noise caused by eyes, mouths and other objects. As a result, the effectiveness and robustness of the network can be improved.
3.4. Recurrent Neural Network ConvGRU
A recurrent neural network (RNN) is a special neural network structure which is different from ordinary 2DCNN. It not only considers the input at the current moment but also gives the network the ability to remember the previous content. This means that RNN has a very good effect in dealing with time-series problems. However, RNN also has some serious shortcomings, such as gradient disappearance and gradient explosion. In order to solve these problems, a series of improved algorithms have appeared, of which there are two main ones: long short-term memory (LSTM) and gated recurrent unit (GRU).
As a variant of LSTM, GRU combines the forget gate and output gate into a single update gate. Compared with LSTM, it has similar performance but requires lower memory and is easier to train [
32]. The GRU allows each cyclic unit to adaptively capture the dependencies of different time scales, and its calculation method is as follows:
where ⊙ is element-wise multiplication.
is the update gate, which is used to control the degree to which the state information of the previous time is substituted into the current state. The larger the value of the update gate, the more the state information of the previous time is brought in.
is the sigmoid activation function.
is the reset gate, which is used to control the degree of ignoring the state information of the previous time. The more obvious its value is, the more it is ignored.
is the candidate hidden layer, similar to
of LSTM, which can be regarded as new information at the current time.
However, for a video task, the input of the convolutional feature map is a three-dimensional tensor, that is, spatial dimensions and channels, which leads to the generation of a large number of parameters if GRU is used directly. In [
33], a ConvLSTM model was proposed, which uses convolution to replace the full connection layer in LSTM. In this way, it captures spatial features by using convolution operations in multi-dimensional data. Furthermore, it avoids the problem of directly using LSTM to generate a large number of parameters. ConvGRU [
34] can be implemented by modifying ConvLSTM and converting the LSTM into GRU for calculation. Its calculation method is as follows:
where * is the convolution operation, and
,
,
,
U,
,
are all 2D convolution kernels. In order to ensure that the spatial size of the hidden representation remains fixed over time in the model, zero padding is used in the recurrent convolution.
A good time context can effectively guide the network to learn the changes of face skin color caused by pulse changes and suppress the noise caused by motion artifacts and illumination changes. Although 3DCNN has a good ability to learn short-term temporal context, its ability to learn long-term temporal context is relatively weak. In order to improve the network’s ability to learn long-term temporal context and make full use of temporal information, ConvGRU is used at the end of ESA-rPPGNet to further process the temporal features extracted by the feature extraction network.
3.5. ESA-rPPGNet
The overall network structure of ESA-rPPGNet is shown in
Figure 10 (Detailed parameter settings can be found in
Table 1).
In ESA-rPPGNet, ESA is mainly used to accurately extract the spatiotemporal features from the input video to recover the rPPG signal. It is constructed based on the mobilenet v3 structure [
35] as a reference. The basic blocks of the network are shown in
Figure 11.
The encoder-decoder structure [
36] is used in the ESA network. First, the temporal dimension is compressed in the front part of the network, and the DCBlock is used at the end of the network to recover the temporal dimension to the original length. Through this encoder-decoder structure, semantic features with less temporal redundancy can be extracted.
Eblock is the basic component module of the encoder which uses the inverted residual structure proposed in mobilenet v2 [
37]. In the inverted residual, the number of channels is first expanded by a PWC. Then, features are extracted through DWC. Finally, the feature maps obtained in the previous step are weighted and fused in the depth direction by PWC, and the number of channels is compressed. The lower the dimension of the tensor, the smaller the multiplication calculation of the convolutional layer, which can effectively improve the overall calculation speed, but may reduce the accuracy of the network. If the filters of the convolutional layer all use low-dimensional tensors to extract features, then there is no way to extract enough information as a whole. By expanding the number of channels first and then compressing the number of channels, the quality of the network and the amount of calculation can reach a balance.
Regarding the nonlinear layer used in the network, it was proved in [
38] that a nonlinear layer called swish can significantly improve the accuracy of the network. This nonlinear layer is defined as:
where
is the sigmoid activation function.
Although this nonlinear layer can improve accuracy, the cost of calculating the sigmoid function on a mobile device is high. For this reason, mobilenet v3 proposes two methods to solve this problem:
Firstly,
is used to replace the sigmoid function, so that the hardware version of swish becomes
Secondly, because the cost of applying swish’s nonlinear layer continues to decrease with the deepening of the network, using swish in the deep layers of the network can bring better results.
According to the above two points, the front part of the ESA network uses ReLU6, and the latter part uses h-swish as the nonlinear layer.
There is a layer of lightweight 3D-SA attention block between DWC and PWC in Eblocks. The 3D-SA module enables the network to focus on important information and fully learn and absorb it, improving the accuracy of the network without bringing too much complexity to the network.
Through the feature extraction network ESA, rich temporal and spatial features are obtained. In order to further strengthen the network’s long-term spatiotemporal feature learning ability, ESA-rPPGNet introduces the ConvGRU module [
34]. Three layers of ConvGRU are used in the end of the network; the output of each layer is fused, and then the spatial dimension is pooled through a 3D adaptive average pooling layer. Since ESA-rPPGNet is designed as a fully convolutional structure, the rPPG signal is finally obtained by 3D convolution with a convolution kernel size of 1 × 1 × 1.