In the experiments, we classify several databases of seven emotions including happy, sad, angry, surprised, disgust, fear, and neutral/contempt. Japanese Female Facial Expression (JAFFE) [
17] is a database of changes in the facial expression of Japanese women. A total of 192 training images of 10 actors are provided. Karolinska Directed Emotional Faces (KDEF) [
18] is a facial expression database for seven emotions with 967 training images provided consisting of photographs of 35 male and 35 female actors for a total of 70 actors. MUG is a seven facial expression database provided by the Multimedia Understanding Group [
19]. It provides photographs of 51 male and 35 female actors, for a total of 86 actors, with ages distributed between 20 and 35 years old. The total number of images in this set is 1746. The Warsaw Set of Emotional Facial Expression Pictures (WSEFEP) [
20] provides frontal face images of 14 male and 16 female actors. The Extended Cohn–Kanade Dataset (CK+) [
21] includes the frame-by-frame changes of the actor expressions over 593 sequences, 123 actors were filmed for between 1 and 14 sub-sequences each. To optimize the Grid Map using reinforcement learning, the optimal classifier and the feature extraction method for assigning compensation values have to be selected by comparison using training databases.
Figure 9 shows an image from the CK+ database. Because the actual face size of the model for each sequence is different, the distribution of facial position in the images differs. Thus, the region of each image containing the actor’s face was isolated and separated for use in training without the background. This database normalization was accomplished by cropping the facial region using a face detector with the same cascade and then normalizing the size of output image to a predetermined size.
The database normalization shown in
Figure 10 uses a face detector using the method of Jones [
22] with a MATLAB default facial cascade. Through this process, it is possible to omit an otherwise required stage for detection of the face area for each learning execution count. Because the same cascade was used, the databases had a consistent form. This preprocessing also applied to JAFFE, KDEF, MUG, and WSEFEP.
4.1. Validation of Classifiers
To select the main classifier and features to be used in the experiment, we compared the accuracy and searched the optimal cell size using four feature extraction classification methods combining HOG, Local Binary Pattern (LBP), ECOC, and k-Nearest Neighbor (kNN).
Table 1 shows the optimal cell size, maximum classification accuracy, feature extraction time, and classification time for 78 test images, each sized 100 × 100 pixels, randomly selected from five databases. For training, we used a single thread of an Intel Core i7-8750H 2.3 GHz processor.
As shown in
Table 1, the HOG-ECOC method exhibits the maximum accuracy and classification speed. The optimal cell size was found to be 10. Notably, smaller cell sizes result in a larger number extracted feature points. All four methods used a specific number of features to provide the highest possible accuracy, but higher numbers of features could be used. The influence of certain facial elements which are important in facial expression recognition can apparently be weakened by others. To evaluate and select the optimal method among the OVO, OVA, and ordinal BTS methods of the ECOC classifier, classification accuracy was measured using 3105 learning images from JAFFE, KDEF, MUG, and WSEFEP.
As shown in
Table 2, the OVO resulted in the highest accuracy and the lowest classification time.
For the use of Q-learning with Grid Map, learning rate (α) must be set as an additional parameter. Training should be intentionally stopped when the classification accuracy does not improve during transition. Accordingly, the learning rate parameter α is set to be determined by the variance of the classification accuracy according to the history of previous Grid Maps included in the policy. This can be written as
where c is a constant that adjusts the influence of the variance value and normalizes α to a range between 0 and 1 during the training process. Its optimal value was determined experimentally through repetition.
4.2. Classification Accuracy
Grid Map was trained using 10-fold cross-validation with two databases, one as a combination JAFFE, KDEF, MUG, and WSEFEP, with the other being CK+ by itself as a database for comparison with other methods. CK+ was separated because its emotion class composition differed from the other four. The optimized Grid Map obtained through CK+ is as shown in
Figure 11.
In the case of
Figure 11, a total of 23 grid splits occurred, and the classification accuracy was 98.4709% for the CK+ database. It can be seen that many small grids exist around the eyes, nose, and mouth. The following experiments include the results of setting the environmental factors needed to achieve the above results and explain how the form in
Figure 11 is derived.
This is an experiment of how padding affects classification accuracy.
Figure 12 is a chart comparing the effect of padding (w padding) and no padding (w/o padding) on the improvement of Grid Map classification accuracy for each transition during the learning process.
Figure 12 illustrates that w/o padding shows higher accuracy in early states, whereas w padding shows higher accuracy in later states. The pattern of chart changes with the 20th segment, which is interpreted as the difference in information loss between the original image and the normalized image during sub-image normalization with each size of grid. As the number of transitions increases, the rate of smaller grid distribution increases, and the aforementioned loss decreases. Therefore, the classification accuracy is reversed by branching at the 20th transition. Because the optimal Grid Map generally appeared after the 20th transition in the iterative verification process, it can be concluded that padding is advantageous for accuracy improvement.
In this experiment, 3105 images were trained using a merged database to optimize the Grid Map. The variables for the minimum grid size were set at 3 and 4.
The training image normalization resolution is 128 × 128, and the max depth is 4, because the minimum image size required for HOG feature extraction using 2 × 2 cells is 4 × 4.
Figure 13 illustrates that the accuracy of the optimal Grid Map is improved when the max depth is high. To increase the max depth further, the normalization resolution of the training image would have to be increased.
Training using CK+ proceeded to the same optimal condition selected in previous experiments.
Figure 14 shows the classification accuracy based on the number of transitions.
Figure 14 shows the measurement of the optimal Grid Map with lower classification accuracy. Based on this result, it was concluded that the size of max depth should be increased.
Training images were cropped during adaptive feature extraction and normalized to the same size before being delivered to the feature extractor, even if they had different depths. As a result of the database normalization, the information in low depth was degraded. To account for this potential problem, we compared the classification accuracy between the increased resolution and the normalization resolution.
Figure 15 shows that increasing the normalized resolution improves the classification accuracy. An additional experiment was performed with increased database resolution and showed that the optimal Grid Map with the highest classification accuracy resulted from the 512 × 512 normalization resolution.
Figure 15 shows the trend of classification accuracy improvement using the Grid Map with three normal resolutions. A normalizing resolution of 1024 × 1024 was also tested, but the classification accuracy decreased to 93.09% and was therefore not included in the chart.
Figure 16 shows the optimal Grid Map for the results shown in
Figure 15.
Figure 16 shows the optimal Grid Map for each resolution, and
Table 3 shows how the grids are distributed at each normalized resolution by depth. There are many high-depth grids distributed around the eyes, nose, and mouth, which clearly show facial expressions. Low-depth grids are distributed in the areas of the forehead, chin, and ear, which do not show facial expressions clearly.
As shown in
Table 3, higher normalized resolutions result in a higher number of higher depth grids. This is interpreted as an improvement in overall classification accuracy caused by the improved resolution resulting in more information being obtained from higher depth grids.
4.3. Result of Feature Reduction
In addition to improving classification accuracy, FER using Grid Map also has the advantage of reducing the number of features required for the same accuracy.
Figure 17 illustrates the cell distribution of two methods with similar classification accuracy. The left image shows the distribution of the HOG-ECOC classifier without adaptive feature extraction, and the right image shows the distribution of Grid Map.
In
Figure 17, the Grid Map shows more efficient cell distribution and shows 0.39% higher classification accuracy even with fewer bins.
Table 4 shows the result of 1000 repeated experiments under the condition used to obtain
Figure 17. classifying one image by random selection among 3105 images from the merged database.
Table 4 presents the computational costs for the basic and proposed methods. The total time (s) and time taken to classify an image (ms) correspond to all the computations for the classifications performed using a single thread of an Intel Core i7-8750H 2.3 GHz processor. Even though the proposed method involves more processes, it incurs a lower computational cost. This Grid Map is the optimal state of the merged database. The classification accuracy according to bin number is compared in
Table 5.
In
Table 5, because the number of bins cannot be accurately compared with the basic method due to the algorithm, rows are arranged based on having a similar number of bins. If the cell size is smaller than 10 × 10, the number of bins in basic method increases exponentially and the accuracy improvement decreases after cell size 8 × 8. According to these results, Grid Map can classify facial expressions in 66.33% of the time required by the basic method under certain conditions.
The proposed method was able to classify the facial expressions more accurately with fewer features. In the adaptive feature extraction process, applying padding and increasing the database resolution to 512 × 512 improved classification accuracy by optimizing the Grid Map. For comparison, we collected experimental results using the CK+ database results from other papers discussing modified FER, as listed in
Table 6.
Table 6 shows that the proposed method has the highest classification accuracy when classifying seven classes. The classification accuracy of the proposed method for six classes was omitted but would be expected to be higher following the trend of existing results having higher accuracy for six classes than for seven. DCNN: Deep Convolutional Neural Network. BDBN: Boosted Deep Belief Network. AUDN: Action Unit-inspired Deep Network.