Three-Stream Convolutional Neural Network with Squeeze-and-Excitation Block for Near-Infrared Facial Expression Recognition

: Near-infrared (NIR) facial expression recognition is resistant to illumination change. In this paper, we propose a three-stream three-dimensional convolution neural network with a squeeze-and-excitation (SE) block for NIR facial expression recognition. We fed each stream with di ﬀ erent local regions, namely the eyes, nose, and mouth. By using an SE block, the network automatically allocated weights to di ﬀ erent local features to further improve recognition accuracy. The experimental results on the Oulu-CASIA NIR facial expression database showed that the proposed method has a higher recognition rate than some state-of-the-art algorithms.


Introduction
Facial expressions carry rich non-verbal information.Machines with the ability to understand facial expressions can better serve humans and fundamentally change the relationship between humans and machines.Therefore, automatic facial expression recognition has attracted attention from many fields, such as virtual reality [1,2], public security [3,4], and data-driven animation [5,6].
The effectiveness of facial expression recognition can be easily affected by environmental changes, such as changes of light, angle, and distance.Among these, the change of illumination conditions under visible light (VIS) (380-750 nm) has the largest influence [7,8].To overcome this influence, an active near-infrared (NIR) illumination source (780-1100 nm) is used for the recognition.In this study, an NIR camera, together with the NIR illumination sources, were placed in front of the subjects.The intensity of the NIR illumination source was much higher than that of the ambient NIR light in indoor environments.Therefore, the ambient illumination problem could be solved as long as the active NIR illumination source is constant.The NIR recognition system is resistant to ambient illumination variations, and has been successfully applied to the field of face recognition [9]; it can perform well even in dark environments [10], in which normal imaging systems fail to perform recognition.
Facial expressions manifest themselves as movements of one or several discrete parts of the face, such as tightening the lips to express anger and raising the mouth to express happiness [11].Some researchers use the features extracted from the entire face, which are called global features [12,13], for recognition, while other researchers use features extracted from specific parts, which are called local features [14][15][16][17].Many researchers have demonstrated that local features improve the performance of facial expression recognition compared with global features [18,19].The main reason for this advancement is that the specific local regions contribute more accurate information of facial changes that help to distinguish the expressions, while the global region contains more identity information.Some researchers [20,21] have pointed out that the eyes, eyebrows, and mouth are the most expressive facial parts.However, it is unknown which part of the face should carry more weight in expression recognition or how the correct weight can be allocated to different parts of the face.
In earlier studies, many facial expression recognition systems used static images [22][23][24] that only contain spatial information as the input.However, facial expression can be a dynamic process, and the dynamic information of the face can better reflect the change of expression.Therefore, it is necessary to extract spatial and temporal information from the image sequences to facilitate recognition.
In the work reported in this paper, we designed a convolutional neural network (CNN) to complete NIR facial expression recognition.The CNN used is a three-stream three-dimensional (3D) CNN, which can learn spatio-temporal information from image sequences.In addition, the three inputs to the CNN are all local features, which not only reduce computational complexity, but also remove information not related to the expressions (such as identity information).A squeeze-and-excitation (SE) block is appended after the 3D CNN, which can automatically assign more weight to the local features that carry more expression information.To overcome the over-fitting problem caused by small data, features are extracted through three identical shallow networks.Finally, we add a global face stream to the local network, further increasing the recognition rate.
The main contributions of this paper are the following: (1) Three local regions of the face are used as the input of the network for the NIR expression recognition, which can not only accurately extract the facial expression information, but also reduce the computational complexity and dimensions; and (2) an SE block is added to model the dependencies between feature channels and adaptively learn the weight of the channel to gain efficient expression information and attenuate the useless information.

Related Work
Facial expressions can be decomposed into movement of one or more discrete facial action units (AUs).Inspired by this theory, Liu et al. [25] located common patches and unique patches of different expressions for recognition.However, this method could cause overlapping of located areas.Liu et al. [26] did further work and proposed a framework called FDM to select the active features of each expression without overlapping.Later, Liu et al. [27] proposed a 3D CNN with deformable action part constraints that can locate and code action units.
To extract temporal features while acquiring spatial features, Ji et al. [28] extended a CNN to a 3D CNN, which can extract the spatio-temporal information from image sequences.Szegedy et al. [29] utilized the 3D CNN to extract temporal information for video-based expression recognition.Chen et al. [30] proposed a new descriptor, the histogram of oriented gradients from three orthogonal planes (HOG-TOP), to extract the dynamic texture features from image sequences, which are fused with the geometric features to identify expressions.Fonnegra et al. [31] proposed a deep learning model and Yan et al. [32] presented collaborative-discriminative-multi-metric-learning (CDMML)-based image sequences for emotion recognition.To make the system more precise, Zia et al. [33] proposed a dynamic weight majority voting mechanism for the construction of ensemble systems.However, since these methods are all based on visible light, the impact of external illumination changes are not considered.
The NIR facial images/videos are hardly influenced by the ambient visible light change.Farokhi et al. [34] proposed a method of extracting global and local features by using Zernike moments (ZMs) and Hermite kernels (HKs), respectively, and then used the fused features to identify the NIR face.Taini et al. [35] assembled a near-infrared facial expression database and completed the first study based on NIR facial expression recognition.Zhao et al. [18] developed the database of NIR facial expressions, called the Oulu-CASIA NIR facial expression database, and used local binary patterns form three orthogonal planes (LBP-TOP) to extract dynamic local features.It was proved in this work that NIR can overcome the influence of visible-light illumination changes on expression recognition.However, these methods must extract facial expression features manually.Jeni et al. [36] proposed a 3D-shape-information-based recognition technique and further proved that an NIR camera configuration is suitable for facial expressions under light-changing conditions.Wu et al. [37] proposed a three-stream 3D convolutional network for NIR facial expression recognition, using a combination of global and local features, but did not consider assigning different weights to local features.

3D CNN
A 3D CNN is more suitable for spatial-temporal feature extraction.In [28], to process image sequences more efficiently, a 3D CNN approach is proposed to address action recognition problems.Through 3D convolution and pooling operations, a 3D CNN has the ability to learn temporal features.
A 3D CNN consists of an input layer, 3D convolution, 3D pooling (usually, each convolution layer is followed by the pooling layer), and a fully connected (FC) layer.The dimension of the input image sequences to the 3D CNN is represented as d × l × h × w, where d is the number of the channels, l the number of frames of video clips, and h and w the height and width, respectively, of each frame.In addition, 3D convolution and pooling have a kernel size in t × k × k, where t is the temporal depth and k the spatial size.

Squeeze-and-Excitation Networks (SENets)
Hu et al. [38] proposed squeeze-and-excitation networks (SENets).The basic architectural unit of SENets is the SE building block, which is shown in Figure 1.
developed the database of NIR facial expressions, called the Oulu-CASIA NIR facial expression database, and used local binary patterns form three orthogonal planes (LBP-TOP) to extract dynamic local features.It was proved in this work that NIR can overcome the influence of visible-light illumination changes on expression recognition.However, these methods must extract facial expression features manually.Jeni et al. [36] proposed a 3D-shape-information-based recognition technique and further proved that an NIR camera configuration is suitable for facial expressions under light-changing conditions.Wu et al. [37] proposed a three-stream 3D convolutional network for NIR facial expression recognition, using a combination of global and local features, but did not consider assigning different weights to local features.

3D CNN
A 3D CNN is more suitable for spatial-temporal feature extraction.In [28], to process image sequences more efficiently, a 3D CNN approach is proposed to address action recognition problems.Through 3D convolution and pooling operations, a 3D CNN has the ability to learn temporal features.
A 3D CNN consists of an input layer, 3D convolution, 3D pooling (usually, each convolution layer is followed by the pooling layer), and a fully connected (FC) layer.The dimension of the input image sequences to the 3D CNN is represented as d×l×h×w, where d is the number of the channels, l the number of frames of video clips, and h and w the height and width, respectively, of each frame.In addition, 3D convolution and pooling have a kernel size in t×k×k, where t is the temporal depth and k the spatial size.

Squeeze-and-Excitation Networks (SENets)
Hu et al. [38] proposed squeeze-and-excitation networks (SENets).The basic architectural unit of SENets is the SE building block, which is shown in Figure 1.Before the SE block operation, input data X is transformed into features U through a series of convolution operations, i.e., F tr :X→U, X∈R W ʹ ×H ʹ ×C ʹ , U∈R W×H×C , where F tr represents the transformation from X to U, H (H ʹ ) and W (W ʹ ) are the frame height and width, Before the SE block operation, input data X is transformed into features U through a series of convolution operations, i.e., F tr : X → U , X ∈ R W ×H ×C , U ∈ R W×H×C , where F tr represents the transformation from X to U, H (H ) and W (W ) are the frame height and width, respectively, and C (C ) are the number channels.
The SE block mainly consists of two operations: Squeeze and excitation.Because the filter learned by each channel in the CNN operates on the local receptive field, each feature map in U cannot utilize the context information of other feature maps.The purpose of the squeeze operation is to have a global receptive field, so that the lower layers of the network can also use global information.The global average pooling operation is used to compress U (multiple feature maps) into Z, so that the C feature maps eventually become real columns of 1 × 1 × C. The squeeze operation is performed by where z m represents the mth element of Z and u m the mth element of U. The excitation operation is a simple gating with a sigmoid activation.The purpose of this operation is to model the interdependence between feature channels by learning parameters to generate the weight of each feature channel.To meet these requirements and limit the model complexity and auxiliary generalization, two FC layers (1*1 conv layer) were introduced.One is the dimension reduction layer, in which the parameter is W 1 and the dimension reduction ratio r; the other is a dimension increase layer with parameter W 2 followed by a Rectified linear unit (ReLU), W The excitation is performed by: where S is the vector after excitation operation, and δ and σ refer to the ReLU function and the sigmoid function, respectively.Finally, S is combined with U to obtain the final output by: where s m is the mth element of S and ∼ x m the mth element of the final output ∼ X; F scale refers to channel-wise multiplication.
The goal of the SE block is to greatly improve the expressiveness of the network; it adaptively recalibrates the feature weight by modeling the interdependencies between the channels.In more detail, it allows the network to use global information to selectively enhance the beneficial features of the channel and suppress the useless function channels.

Proposed System
In this paper, we propose a three-stream 3D CNN with an SE block called an SE three-stream fusion network (SETFNet).We took three local regions, the eyes (including eyebrows), nose, and mouth, from the facial expression image sequence as inputs to the three-stream network.After fusions of the three streams, an SE block was added to the network to adaptively learn the weight of each feature channel.
To avoid over-fitting problems, a deep CNN requires large amounts of data for training.However, the available database for NIR expression is small in size.To train a CNN model on a small database, researchers use a medium-size CNN [39,40].Therefore, the SETFNet in this paper was also a medium-size CNN with four convolutional layers.
The structure of the proposed SETFNet is shown in Figure 2. It is a three-stream 3D CNN consisting of three identical sub-networks.Each sub-network consists of four convolutional layers and has the same parameters.The number of convolution kernels for the four convolution layers, first through fourth, is 16, 32, 64, and 128, respectively.The kernel size of the first convolution layer is 3×3×8, and a large temporal stride here is used to eliminate some useless information.The kernel size of the other three convolution layers is 3×3×3.The three streams were fused and followed by an SE block to recalibrate the weight of each stream.The details of each subnetwork are shown in Table 1.

Layers Kernel Parameter Settings Number of Kernels Output Size
Electronics 2019, 8, x FOR PEER REVIEW 5 of 16 streams were fused and followed by an SE block to recalibrate the weight of each stream.The details of each subnetwork are shown in Table 1.

Fusion Network
After extracting the features from the three regions (eyes, nose, and mouth), three stream features defined as T , T , and T were obtained.The three stream features were then concatenated together to achieve better recognition by where T is the fused feature and ⊕ represents the concatenation operation.The concatenated features T were used as inputs to the next operation of the network.

Experiments
The proposed network was assessed on the Oulu-CASIA NIR facial expression database [18].The network was implemented in the Caffe framework, which ran on a PC with a NVIDIA Geforce GTX 1080 graphical processing unit (GPU) (8 G).Training a model with the correct parameters is the key to achieving optimal performance, which has a direct impact on

Fusion Network
After extracting the features from the three regions (eyes, nose, and mouth), three stream features defined as T 1 , T 2 , and T 3 were obtained.The three stream features were then concatenated together to achieve better recognition by where T is the fused feature and ⊕ represents the concatenation operation.The concatenated features T were used as inputs to the next operation of the network.

Experiments
The proposed network was assessed on the Oulu-CASIA NIR facial expression database [18].The network was implemented in the Caffe framework, which ran on a PC with a NVIDIA Geforce GTX 1080 graphical processing unit (GPU) (8 G).Training a model with the correct parameters is the key to achieving optimal performance, which has a direct impact on the experimental results.We trained the network from scratch using a batch size of 4, an initial learning rate of 10 −3−3 , and a weight decay of 0.0005.

Database
Because the NIR facial expression database is not very common, the Oulu-CASIA NIR facial expression database is currently the only suitable one.It was collected in dark, weak, and normal light conditions, and consists of six kinds of facial expressions (anger, disgust, fear, happiness, sadness, and surprise) of 80 people between 23 and 58 years old, so each illumination condition has 480 image sequences.All expression sequences begin at the neutral emotion and end with the peak of the emotion.Each subject was asked to sit on a chair in the observation room in a way that they were in front of the camera.The distance between the face and camera was approximately 60 cm.Subjects made expressions according to the image sequences, while videos were captured by a USB 2.0 PC Camera (SN9C 201 & 202).Each clip was filmed by the camera at a frame rate of 25 fps.The image resolution was 320 × 240.
The aforementioned database has been used in many studies of facial expression recognition.It has been proved that the identification task under dark illumination conditions is the most difficult [18], because the facial image loses most of the texture features in dark light conditions.Therefore, we tested the proposed network on this most difficult sub-dataset (dark illumination condition).
We used the very popular method of tenfold cross-validation.All of the image sequences were divided into 10 groups.At each fold, nine groups were used to train the network and the rest were used for testing.During the entire experiment, there was no overlap between the training and testing sets.

Data Pre-Processing
In our experiment, a video sequence was pre-processed in the following three steps: (1) Frameby-frame face detection; (2) locating eyes, nose, and mouth; and (3) cropping off the eyes, nose, and mouth areas.We found that step 2 had a significant effect on the performance of the network, so the choice of area to perform accurate spotting is crucial.To ensure that this was done accurately, the local areas were cropped based on the location of landmark points annotated by a robust landmark detector, discriminative response map fitting (DRMF) [41].DRMF not only achieves good performance in landmark-detection methods [30], but also consumes very little computation time.
The cropping of these local areas was done by an automatic method.Since some of the cuts are inaccurate, manual cropping was used.Using the facial landmark points annotated earlier, the three regions were identified by using rectangular bounding boxes determined based on the eyes, nose, and mouth landmark points.We segmented the three local regions according to the following eleven points: E1 (x 1 , y 1 ), E2 (x 2 , y 2 ), E3 (x 3 , y 3 ), E4 (x 4 , y 4 ), E5 (x 5 , y 5 ), N1 (x 6 , y 6 ), N2 (x 7 , y 7 ), M1 (x 8 , y 8 ), M2 (x 9 , y 9 ), M3 (x 10 , y 10 ), and M4 (x 11 , y 11 ) (shown in Figure 3).The center point of the rectangular bounding box of the eye region is L1 = E5 (x 5 , y 5 ), and the length and width of the rectangle are 5  3 |x 2 − x 1 | and 4  3 |y 4 − y 1 |, respectively.The center point of the rectangular bounding box of the nose region is L2 = (x 5 ,

Comparisons of Different Streams and Their Fusion
Table 2 shows the average results of tenfold cross-validation for each local region using a single sub-network (one stream) and a fused network.The feature information of the eye (including eyebrows), nose, and mouth regions is extracted by a single stream and the recognition rates are 35.37%,42.76%, and 68.35%, respectively.The mouth region has the highest recognition rate, which may indicate that this part is the most expressive part in the database.The recognition rate of the eye region is the lowest among the three regions.This may be due to some of the participants wearing glasses.In the NIR face image, the NIR light reflected by the glasses removes the feature of the eyes, so that the frames with glasses have a great influence on recognition.At the same time, we can see that the performance of the For the network input, each video sequence is normalized to 32 frames using the linear interpolation method [42].Each frame of a global face (whole face) and local areas were resized to 88 × 108 and 36 × 64, respectively.To reduce the amount of calculation, all input images were converted to 8-bit grayscale.

Comparisons of Different Streams and Their Fusion
Table 2 shows the average results of tenfold cross-validation for each local region using a single sub-network (one stream) and a fused network.The feature information of the eye (including eyebrows), nose, and mouth regions is extracted by a single stream and the recognition rates are 35.37%,42.76%, and 68.35%, respectively.The mouth region has the highest recognition rate, which may indicate that this part is the most expressive part in the database.The recognition rate of the eye region is the lowest among the three regions.This may be due to some of the participants wearing glasses.In the NIR face image, the NIR light reflected by the glasses removes the feature of the eyes, so that the frames with glasses have a great influence on recognition.At the same time, we can see that the performance of the recognition rate of the three-local-stream-fused networks (TFNets) reaches 78.68%, which is much higher than that of each single stream network (eye, 35.37%; nose, 42.76%; mouth, 68.35%).This indicates that our fusion is very effective in improving the recognition rate.After the network was fused, we added the SE block that automatically allocates weights to different streams.Since the SE block can make the entire network adaptively learn the weight of the feature channel, the SETFNet further improves the recognition rate, reaching a recognition rate of 80.34%.To investigate whether the SETFNet had extracted most of the expression features, we added one more stream to the SETFNet, which takes the frame of the global face as the input.Because each frame of the global face has larger spatial size than that of each local area, we added one more convolution pair to this added stream.The network structure is shown in Figure 4, with the fourth stream being the global face stream.When it is added to the SETFNet, the recognition rate becomes 81.67%.The SETFNet itself can achieve an 80.34% recognition rate.That is to say, after adding the entire face as input, the improvement of the recognition rate is still limited.This may indicate that the SETFNet has extracted most of the expression features.
Table 2 also shows the time consumption of various single sub-networks and fused networks.The time for a single sub-network to process an image sequence is 0.515 s, and the time for TFNet and SETFNet to process a sequence is 1.158 and 1.237 s, respectively.Considering the large improvement in recognition rate made by the TFNet and SETFNet, the increase of computation time is acceptable.However, when a global face stream is added to the SETFNet, the time for the network to process a sequence is 2.142 s.The slight increase in recognition rate (80.34% versus 81.67%) made by the global stream is at the expense of the processing time (1.237 s versus 2.142 s).However, all of the computation time may be within acceptable limits, since the input is 32 frames.Under the hardware settings used (NVIDIA Geforce GTX 1080 GPU (8G) for deep-learning acceleration), the SETFNet can process 32/1.237= 25.87 frames every second.The frame rate of a normal imaging system is 25-30 fps, and 25.87 fps is within this range, which means that the SETFNet can give the recognition result just 1 s of lag in real-time imaging if the computation is performed in parallel with the imaging.With better hardware, the computation time can be further decreased to or to less than 1 s, which makes the processing a real-time process.Therefore, this network could be used in real applications.recognition rate is still limited.This may indicate that the SETFNet has extracted most of the expression features.Table 2 also shows the time consumption of various single sub-networks and fused networks.The time for a single sub-network to process an image sequence is 0.515 s, and the time for TFNet and SETFNet to process a sequence is 1.158 and 1.237 s, respectively.Considering the large improvement in recognition rate made by the TFNet and SETFNet, the increase of computation time is acceptable.However, when a global face stream is added to the SETFNet, the time for the network to process a sequence is 2.142 s.The slight increase in recognition rate (80.34% versus 81.67%) made by the global stream is at the expense of the processing time (1.237 s versus 2.142 s).However, all of the computation time may be within acceptable limits, since the input is 32 frames.Under the hardware settings used (NVIDIA Geforce GTX 1080 GPU (8G) for deep-learning acceleration), the SETFNet can process 32/1.237= 25.87 frames every second.The frame rate of a normal imaging system is 25-30 fps, and 25.87 fps is within this range, which means that the SETFNet can give the recognition result just 1 s of lag in real-time imaging if the computation is performed in parallel with the imaging.With better hardware, the computation time can be further decreased to or to less than 1 s, which makes the processing a real-time process.Therefore, this network could be used in real applications.
The recognition rate of the eye region is the lowest among the three regions.One reason may be that the eyes have fewer features than the other parts; another reason could be that some of the subjects wear glasses.To verify the effect of glasses on the recognition rate, we input the eyes with and without glasses into the sub-network separately.The recognition results are shown in Table 3.It is seen that the recognition rate without glasses is better than that with glasses, which indicates that the glasses remove some features of the eyes.Since we divided the dataset into two parts, the recognition rates of wearing glasses and not wearing glasses are lower than that of the single sub-network with all data as the input.The recognition rate of the eye region is the lowest among the three regions.One reason may be that the eyes have fewer features than the other parts; another reason could be that some of the subjects wear glasses.To verify the effect of glasses on the recognition rate, we input the eyes with and without glasses into the sub-network separately.The recognition results are shown in Table 3.It is seen that the recognition rate without glasses is better than that with glasses, which indicates that the glasses remove some features of the eyes.Since we divided the dataset into two parts, the recognition rates of wearing glasses and not wearing glasses are lower than that of the single sub-network with all data as the input.

Comparison of Embedded SE Block
The SE block was added to the network after the fusion so that the network could receive the information of the entire network and have a global receptive field.In the SE block, the reduction ratio r is an important parameter that can change the capacity and computational cost.We compared different reduction ratios r in our network model and the results are shown in the Table 4.When r = 16, the accuracy is the highest; therefore, r is set as 16.

Comparisons with Other Methods
Table 5 shows the different expression recognition rates of different methods on the Oulu-CASIA NIR facial expression database under dark-lighting conditions.For all of the methods, we used the tenfold cross-validation method to obtain an average recognition rate.The results of Deep Temporal Geometry Network (DTAGN), 3D CNN Deformable Facial Action Parts (DAP), and NIRExpNet were obtained from [37], and the result of LBP-TOP was obtained by implementing the algorithm using MatLab software (MathWorks, Natick, MA, USA).SETFNet and SETFNet + global were implemented by using Caffe.It is seen that LBP-TOP and 3D CNN DAP can achieve recognition rates of 69.32% and 72.12%, respectively, which are higher than that of DTAGN.NIRExpNet used the fusion information of local and global features, and therefore can achieve an even higher recognition rate than LBP-TOP and 3D CNN DAP.SETFNet uses only local information of three regions, but it can achieve a higher recognition rate (even higher than NIRExpNet, which uses local and global features).When a global face stream is added to SETFNet, it further improves the recognition rate to 81.67%.This indicates that the automatic allocation of the weight-of-features channel helps improve the recognition performance, which could be a promising method for NIR facial expression.

Confusion Matrixes
To analyze the experimental results further, the confusion matrixes of SETFNet and SETFNet + global are shown in Tables 6 and 7, respectively.The labels on the left-hand side represent actual classes and those at the bottom represent the predicted classes; each percentage value in the matrix was calculated by dividing the number of a predicted class to the number of the corresponding actual class.After adding the global stream, the recognition rate of each expression is increased by 1-2%.It can be seen from Tables 6 and 7 that whether or not the global face stream is added, both happiness and surprise have high recognition rates, while fear and disgust have relatively low rates.The latter low recognition rates may be due to the slight movement of AUs for fear and disgust, which makes it more difficult to distinguish them from other expressions.Moreover, disgust is confused with anger, fear, and sadness, and fear is confused with anger, disgust, happiness, and surprise, perhaps because their appearance and movements are similar to each other.SETFNet + global takes the entire face as input.The more input features there are, in general, should increase the true prediction values (values on the diagonal of the confusion matrix) and decrease the false prediction values (the zero value will be unchanged).It is seen from Table 6 that SETFNet + global does increase all true prediction values.However, more input does not always decrease the false prediction values.We can see from Table 7 that increased false prediction values do exist, which are indicated by up-pointing arrows.As the database is small in size, the prediction values could vary due to noise.To ensure that the located false prediction values are increased only as a result of more input features, we located their paired false prediction values as well.Each false prediction value pair appears in the same color in Table 7; for example, 9.54% (fear predicted as anger) and 0% (anger predicted as fear) in green.Only when both paired values are increased can the two expressions be considered as confused with each other more in SETFNet + global.
Under this criterion, we can see that sadness tends to be more recognized as disgust (8.25% versus 3.52%), or disgust tends to be more recognized as sadness (4.08% versus 2.50%), if SETFNet + global is used.The reason for this might be that, in sadness and disgust expression situations, lower cheek areas have an up-and-down movement pattern due to the movement of AU15 or AU10 [44].When SETFNet + global takes these similar movement patterns as input, sadness will be recognized as disgust more.
Tables 8-11 show the confusion matrix of the comparison algorithms, with the labels on the left-hand side representing actual classes and those at the bottom representing the predicted classes.The confusion matrix of NIRExpNet (Table 8) was adopted from [37] directly.The other matrixes were obtained by implementing the algorithms with MatLab code on the database (tenfold cross-validation).Happiness and surprise again have higher recognition rates than the others in all algorithms.Fear has the lowest average recognition rate, and disgust has a similar average recognition rate to that of anger and sadness.This trend is in accord with what SETFNet reveals.To further analyze the discrimination ability of different methods, we counted the number of zero false prediction values in each matrix.This number indicates that two corresponding expressions are perfectly recognized by the method.It is observed that NIRExpNet has 20 zero false prediction values, much more than other methods.3D CNN DAP, DTAGN, and LBP-TOP have a similar number of zero false prediction values (approximately 12).These results indicate that NIRExpNet has the best performance in distinguishing one expression from others.This could be because NIRExpNet is designed specifically for the dataset.The features extracted by NIRExpNet are balanced so the possibility of confusing one expression with others is small.Some zero false prediction values do not have zero paired values, e.g., the values in red in Table 9. 4.51% of the surprise expression was recognized as anger, but 0% anger was recognized as surprise using 3D CNN DAP.This could be due to the noise of the small dataset.
The F1 score and Matthews correlation coefficient (MCC) are calculated using the confusion matrixes, which are indexes considering accuracy and recall of the classification results and are fairer methods for assessing a classifier.The F1 score and MCC are summarized in Table 12.It is observed that SETFNet and SETFNet + global have the highest F1 and MCC, NIRExpNet has the second-highest values, and 3D CNN DAP the third highest.LBP-TOP and DTAGN have the lowest F1 and MCC.This indicates that SETFNet outperforms other methods in even more rigorous assessment.The order of the F1 and MCC performance of the methods is in accord with accuracy performance.This also indicates that the number of each sub-category is well balanced.

Potential Application and Improvement
SETFNet, which used three regions of the face as the input, can achieve higher recognition rates than NIRExpNet, which used the entire face as input, because an SE block can automatically allocate the weights to different streams.These results suggest that the automatic allocation of weights to different features will help improve the recognition rate.This idea of automatic allocation may have potential use in other recognition tasks.The SE block can always be added after a feature fusion step to allocate weights to different features to further improve the recognition rate.
SETFNet + global has a slightly higher recognition rate than SETFNet, but consumes much more calculation time.This indicates that a small part of the face could carry most of the expression information.For any other type of facial expression recognition task, we may only analyze the parts of face carrying expression information, which can save much calculation time and make recognition a real-time application.
The highest recognition rate on the Oulu-CASIA NIR facial expression database (dark condition) is 98.6%, achieved by Rivera et al. [45].A number transitional graph method (DNG) was proposed in [45].The confusion matrixes achieved by DNG method were summarized in Tables 13 and 14 (adopted from [45] directly), with the labels on the left-hand side representing actual classes and those at the bottom representing the predicted classes.Table 13 is the confusion matrix of DNG using 3D Sobel (DNG S ), and Table 14 is the confusion matrix of DNG using nine-plane mask (DNG P ).It is seen that the recognition rate of each expression class is more than 97% and similar to each other.This may indicate that the DNG has obtained good enough features to discriminate one expression from others.In terms of zero false prediction values, DNG S has 21 zero false prediction values, and DNG P has 23 zero false prediction values, which are less than all other methods.This indicates that the DNG method can achieve the most un-confused matrix.The F1 and MCC of DNG are higher than other methods, as well (DNG S : F1 0.9859, MCC 0.9830; DNG P : F1 0.9879, MCC 0.9856).This indicates that DNG outperforms other methods in more rigorous assessment.DNG consists of designed feature-extraction and feature-fusion methods, which make the extracted features robust in uneven illumination conditions.This could be the reason why DNG can achieve the best performance.According to the design of the DNG, two aspects could be considered in the future design of the SETFNet.Firstly, the uneven illumination conditions in the database could be taken into account when designing the network, such as using the features extracted from DNG as a stream to the network.Secondly, a more sophisticated fusion method could be used in future design, e.g., the concatenation operation used in this paper could be replaced by the fusion method in DNG.
However, a different form of DNG using hand-crafted features, SETFNet, proposed in this paper extracts features automatically.This design does not need the background knowledge of the data.Specifically, The feature extraction in this paper was finished by using a 3D CNN.Since the dataset used for training the CNN is small in size, the proposed network is not deep enough and may not extract high-level features.To further improve the recognition rate, transfer learning could be used, i.e., training a deeper CNN on a larger dataset and then fine-tuning the network on the NIR database.

Conclusions
In this paper, we proposed a three-stream 3D CNN architecture with an SE block called SETFNet that can automatically learn spatio-temporal features simultaneously.We only used three local regions of the face as input to the network.The advantages of using local information as input to the network were the removal of some information unrelated to recognition and a reduction of the amount of computation.To enable the network to adaptively learn the weight of each feature channel, an SE block was added to the network after the fusion of three single sub-networks.Experimental results show that SETFNet can achieve an average recognition rate of 80.34%; when a global face stream was added to SETFNet, the recognition rate was further increased to 81.67%, which is higher than some state-of-the-art methods.

Figure 2 .
Figure 2. Overall structure of the proposed SE three-stream fusion network (SETFNet).The SE block is displayed in the dotted box.

Figure 2 .
Figure 2. Overall structure of the proposed SE three-stream fusion network (SETFNet).The SE block is displayed in the dotted box.

y 7 −y 6 2)
, and the length and width of the rectangle are |y 7 − y 6 | and |x 3 − x 4 |, respectively.The center point of the rectangular bounding box of the mouth region is L3 = (x 5 , y 11 −y 9 2 ), and the length and width of the rectangle are 5 3 |x 10 − x 8 | and 4 3 |y 11 − y 9 |, respectively.Electronics 2019, 8, x FOR PEER REVIEW 7 of 16

Figure 3 .
Figure 3. Positions of 11 points for segmenting three regions.

Figure 3 .
Figure 3. Positions of 11 points for segmenting three regions.

Figure 4 .
Figure 4. Structure of SETFNet plus global face stream.

Figure 4 .
Figure 4. Structure of SETFNet plus global face stream.

Table 1 .
Configuration of each stream.

Table 2 .
Comparison of different local and fused networks.

Table 3 .
Comparison of recognition rate with and without glasses.

Table 3 .
Comparison of recognition rate with and without glasses.

Table 4 .
Comparison of different network reduction ratios.

Table 5 .
Comparison of total recognition rates of different methods.

Table 6 .
Confusion matrix of SETFNet.Labels on left-hand side represent actual classes; those on bottom represent predicted classes.

Table 7 .
Confusion matrix of SETFNet + global.Labels on left-hand side represent actual classes; those on bottom represent predicted classes.

Table 12 .
Comparison of F1 score and MCC of different methods.

Table 13 .
Confusion matrixes of DNG S .

Table 14 .
Confusion matrixes of DNG p .