Facial Expression Recognition Based on DWT Feature for Deep CNN

Facial expressions recognition have become one of the most important fields of research in pattern recognition, in this paper, we propose a method to identify the facial expressions of the people through their emotions, this method combining Viola-Jones face detection algorithm, Facial image enhancement using histogram equalization, discrete wavelet transform (DWT) and deep convolution neural network. Extraction results of facial features using DWT are the input of CNN, which are used directly to train the CNN network. Our experimental were performed on CK+ database and JAFFE face database, the obtained results based on this network is 96.46% and 98.43% respectively.


Introduction
Facial recognition is currently the most important biometric identification technology; this technique has many advantages, such as low cost and high reliability. Facial recognition has been used in several areas such as pattern recognition, computer vision, security, and cognitive science.
In recent years, facial expression recognition (FER) techniques aroused more and more interest on the part of the scientific community [1,2]. Facial expressions are an effective way in human and computer machine interaction and non-verbal interpersonal communications. It has many different applications in various fields such as security-surveillance, artificial intelligence, military and police services, and psychology, among others.
Facial expressions are classified into six basic categories; namely, anger, disgust, fear, sadness, happiness, and surprise-a neutral expression was also added to this group.
Facial expression recognition goes through three main steps. The first is face detection in the image. Its effectiveness has a direct influence on the performance of the FER system. The second important step of an FER system is facial features extraction, and the third and last step is expression classification.
These methods have been used to improve the recognition rate and speed of facial expression recognition, but several challenges are confronted with regard to variations in the person's pose, changes in illumination, and so on.
Alternative methods are based on transforms such as Fourier transform (FT), short time FT (ST-FT), and discrete wavelet transform (DWT) [9]. Feature extraction based on the DWT method is very useful for FER with very low computational cost, which makes it an ideal tool for image processing and computer vision.
The main contributions of the proposed methodology are as follows: first, to develop a robust feature extraction approach; second, to improve the performance of the FER system and obtain a high recognition rate.
In this paper, we propose a model by applying the Viola-Jones face detection algorithm to detect faces, and to separate the faces from the rest of the parts considered non-faces. For our application, we have opted for different types of image enhancement algorithms to improve image contrast, the evaluation and judgment of each type in relation to the other is given by the calculation of the following parameters: absolute mean brightness error (AMBE) and peak signal to noise ratio (PSNR). Moreover, we employed the discrete wavelet transform (DWT) on face images to extract features.
Finally, the classification process will be done by deep learning through convolutional neural networks (CNN). Convolution neural networks are a type of artificial neural network that has been used in several areas such as classification, decision-making, and so on.
The rest of the paper is organized as follows. In Section 2, we deal briefly with the related work in this field; in Section 3, we describe the four steps of our facial expressions recognition system; we present our experimental results in Section 4; and finally, Section 5 concludes our work.

Related Work
In recent years, the scientific community has shown an increasing interest in the domain facial expressions; researchers have used several techniques in order to obtain a better representation of facial expressions, such as principal component analysis (PCA) and local binary patterns (LBP).
In this section, recent research on facial expression recognition (FER) using CNN that have a high degree of accuracy will be described.
In 2014, Liu et al., [10] used 3D-CNN for facial expression recognition (FER), and for localized action parts of the face, they used a deformable facial action part model.
In 2015, Peter Burkert et al., [11] used CNN for facial expression recognition (FER). The feature extraction proposed in this work is independent of any hand-craft. Dennis H. et al., [12] applied two CNN channels on facial image; the information is combined from the two channels to achieve 94.4% recognition accuracy.
In 2016, Cui, R. et al., [13] proposed an approach based on the CNN network that uses a set of outputs of three CNNs for classification.
In 2017, Nwosu, L. et al., [14] proposed an approach consisting of two steps: the Viola-Jones method for the detection of facial parts and deep CNN for feature extraction and classification. This method generates 97.71% and 95.72% recognition accuracy for JAFEE and CK+ datasets, respectively.
In 2018, Yang, B. et al., [15] proposed an FER method consisting of three steps: Viola-Jones face detection, local binary patterns (LBP) feature extraction, and weighted mixture deep neural network (WMDNN) based on double-channel facial images; the recognition accuracy for JAFEE and CK+

Face Detection Using the Viola-Jones Algorithm
The effectiveness of biometric systems based on face authentication essentially depends on the method used to locate the face in the image. In our method, we use the Viola-Jones algorithm to detect various parts of the human face such as the mouth, eyes, nose, nostrils, eyebrows, mouth, lips, and ears [16]. While several researchers are trying to reach an algorithm to detect the human face and its parts, the most effective algorithm one was proposed by Paul Viola and Michael Jones in 2001. This algorithm has been implemented in 'Matlab' using the vision Cascade Object Detector. There are three important techniques used by Viola-Jones for the detection of facial parts: 1. Haar-like features are digital image features of a rectangular type used in object recognition. 2. Ada boost is an artificial intelligence and machine learning method for face detection. The term 'boosted' determines a principle that brings together many algorithms that rely on sets of binary classifiers [17].
3. The third and last step is Cascade classifier, which can efficiently combine many features and determine the several filters on a resultant classifier. An example of the Viola-Jones algorithm is shown in Figure 2.

Face Detection Using the Viola-Jones Algorithm
The effectiveness of biometric systems based on face authentication essentially depends on the method used to locate the face in the image. In our method, we use the Viola-Jones algorithm to detect various parts of the human face such as the mouth, eyes, nose, nostrils, eyebrows, mouth, lips, and ears [16]. While several researchers are trying to reach an algorithm to detect the human face and its parts, the most effective algorithm one was proposed by Paul Viola and Michael Jones in 2001. This algorithm has been implemented in 'Matlab' using the vision Cascade Object Detector. There are three important techniques used by Viola-Jones for the detection of facial parts: 1. Haar-like features are digital image features of a rectangular type used in object recognition. 2. Ada boost is an artificial intelligence and machine learning method for face detection. The term 'boosted' determines a principle that brings together many algorithms that rely on sets of binary classifiers [17].
3. The third and last step is Cascade classifier, which can efficiently combine many features and determine the several filters on a resultant classifier. An example of the Viola-Jones algorithm is shown in Figure 2. In the case of a color image, the input image must be converted to grayscale. In what follows we will detail each step of the FER system, the proposed system is shown in Figure 1.

Face Detection Using the Viola-Jones Algorithm
The effectiveness of biometric systems based on face authentication essentially depends on the method used to locate the face in the image. In our method, we use the Viola-Jones algorithm to detect various parts of the human face such as the mouth, eyes, nose, nostrils, eyebrows, mouth, lips, and ears [16]. While several researchers are trying to reach an algorithm to detect the human face and its parts, the most effective algorithm one was proposed by Paul Viola and Michael Jones in 2001. This algorithm has been implemented in 'Matlab' using the vision Cascade Object Detector. There are three important techniques used by Viola-Jones for the detection of facial parts: 1. Haar-like features are digital image features of a rectangular type used in object recognition. 2. Ada boost is an artificial intelligence and machine learning method for face detection. The term 'boosted' determines a principle that brings together many algorithms that rely on sets of binary classifiers [17].
3. The third and last step is Cascade classifier, which can efficiently combine many features and determine the several filters on a resultant classifier. An example of the Viola-Jones algorithm is shown in Figure 2.

Enhancement Techniques
In this experiment, we have discussed a number of techniques for image enhancement such as the following.

Histogram Equalization
Histogram equalization is a method of adjusting the contrast of a digital image. It consists of applying a transform on each pixel of the image, and hence obtaining a new image from an independent operation on each of the pixels. This transform is constructed from the accumulated histogram of the original image [18].
The histogram equalization makes it possible to better distribute the intensities over the entire range of possible values by "spreading" the histogram. Equalization is interesting for images whose whole or only part is of low contrast (the set of pixels are of close intensity). The method is fast, easy to implement, and fully automatic.
Let X be the input image, the intensity values in an image can be regarded as random variables that can have any value between [0, L−1], the discrete gray levels in the dynamic range is L, and X(i,j) represents the intensity of the image at spatial location (i, j) that satisfies the condition (i, j) {X 0 , X 1 , . . . , X L−1 }. The histogram 'h' of the digital image is defined as the discrete function and is given by (1).
n k is the number of pixels in the input image.
The probability density function (PDF) is defined by (2).
where M × N is the size of the image X. The cumulative distribution function (CDF) is obtained by (3).
HE is achieved by having a transform function T (X k ), which can be defined as the (CDF) of a given (PDF) of gray-levels in a given image, which is defined as shown in (4).
This method successfully increases the global contrast of images, but it has several shortcomings such as the loss of some details in the image, some local areas become brighter than before, and it also fails to conserve the brightness of the image.

Adaptive Histogram Equalization
To avoid the drawbacks of the histogram equalization method discussed above, a modification of HE called the adaptive histogram equalization (AHE) can be used on such images for better results. In AHE, the input image is divided into small blocks called "tiles". Then, the histogram equalization method (AHE) is applied for each of these tiles using the CDF. It is, therefore, a local operation that can be enhanced simultaneously with all regions occupying different grayscale ranges, and enhancing the definitions of edges in each region of an image [19]. However, the AHE method has diverse drawbacks such as over amplifying the noise in the relatively homogeneous regions and a very high computational cost.

Contrast Limited Adaptive Histogram Equalization
The contrast limited adaptive histogram equalization (CLAHE) is a variant of adaptive histogram equalization (AHE). This method limits contrast amplification before computing the CDF by clipping the histogram at a predefined value, so as to overcome the problem of noise. The value at which the histogram is clipped, called clip limit depends on the normalization of the histogram and thereby on the size of the neighboring region [20].

Extraction of Facial Features by Discrete Wavelet Transform (DWT)
The extraction of features such as the eyes, nose, and mouth is a pre-treatment step necessary for facial expression recognition. In this step, we applied the discrete wavelet transform.
The wavelet is a famous tool in image processing and computer vision, and has several applications, such as compression, detection, recognition, and so on. The discrete wavelets transform (DWT) has the ability to locate a signal in both time and frequency resolutions at the same time. DWT is considered as a new generation of the discrete Fourier transform (DFT) [21].
DWTdecomposes the signal at several bands or frequencies; it involves filters of DWT known as the 'wavelet filter' and 'scaling filter'. The wavelet filters are a high pass filter and low pass filter. The DWT performs on different mother wavelets such as Haar, Symlet, and Daubechies.
In image processing, 2D-DWT is employed to perform operations throughout the rows of original images by employing both the low pass filter (LPF) and high pass filter (HPF) simultaneously [22]. Then, it is down-sampled by a factor of 2 and a detailed part (high frequency) and approximation part (low frequency) are achieved.
A further operation is performed throughout image columns. Four sub-bands are generated at each decomposition level: an 'approximation' sub-band (LL), and three 'detail' sub-bands-vertical (LH), horizontal (HL), and diagonal detail (HH) (see Figure 3). We considered 'Symelt' wavelet as amother wavelet in our approach [23]. operation that can be enhanced simultaneously with all regions occupying different grayscale ranges, and enhancing the definitions of edges in each region of an image [19]. However, the AHE method has diverse drawbacks such as over amplifying the noise in the relatively homogeneous regions and a very high computational cost.

Contrast Limited Adaptive Histogram Equalization
The contrast limited adaptive histogram equalization (CLAHE) is a variant of adaptive histogram equalization (AHE). This method limits contrast amplification before computing the CDF by clipping the histogram at a predefined value, so as to overcome the problem of noise. The value at which the histogram is clipped, called clip limit depends on the normalization of the histogram and thereby on the size of the neighboring region [20].

Extraction of Facial Features by Discrete Wavelet Transform (DWT)
The extraction of features such as the eyes, nose, and mouth is a pre-treatment step necessary for facial expression recognition. In this step, we applied the discrete wavelet transform.
The wavelet is a famous tool in image processing and computer vision, and has several applications, such as compression, detection, recognition, and so on. The discrete wavelets transform (DWT) has the ability to locate a signal in both time and frequency resolutions at the same time. DWT is considered as a new generation of the discrete Fourier transform (DFT) [21].
DWTdecomposes the signal at several bands or frequencies; it involves filters of DWT known as the 'wavelet filter' and 'scaling filter'. The wavelet filters are a high pass filter and low pass filter. The DWT performs on different mother wavelets such as Haar, Symlet, and Daubechies.
In image processing, 2D-DWT is employed to perform operations throughout the rows of original images by employing both the low pass filter (LPF) and high pass filter (HPF) simultaneously [22]. Then, it is down-sampled by a factor of 2 and a detailed part (high frequency) and approximation part (low frequency) are achieved.
A further operation is performed throughout image columns. Four sub-bands are generated at each decomposition level: an 'approximation' sub-band (LL), and three 'detail' sub-bands-vertical (LH), horizontal (HL), and diagonal detail (HH) (see Figure 3). We considered 'Symelt' wavelet as amother wavelet in our approach [23].

Classification Using Deep Convolutional Neural Networks
Convolutional neural networks are deep artificial neural networks, primarily used to classify images and group them based on similarity. CNNs are algorithms that can identify faces, character, human pose, tumors, street signs, and so on [24].
Through the use of discrete wavelet transform, features extraction of human face local texture was performed. The result is the input to the deep convolution neural network.
In this paper, as shown in Figure 4 below, we propose a network structure that contains three convolutions, two pooled layers, and one fully connected layer.

Classification Using Deep Convolutional Neural Networks
Convolutional neural networks are deep artificial neural networks, primarily used to classify images and group them based on similarity. CNNs are algorithms that can identify faces, character, human pose, tumors, street signs, and so on [24].
Through the use of discrete wavelet transform, features extraction of human face local texture was performed. The result is the input to the deep convolution neural network.
In this paper, as shown in Figure 4 below, we propose a network structure that contains three convolutions, two pooled layers, and one fully connected layer.

Convolutional Layers
One of the most important operations in the CNN is the convolutional layers (ConvL); CNN comprises one or more ConvL, and this latter is the basic building block that performs the core building block of a convolutional network that does most of the computational heavy lifting [25].
Like the traditional neural network, the input of each ConvL is the output of the upper layer, each of the feature graphs in the ConvL correspond to a kernel convolution of the same size and each of the feature maps of the ConvL is convoluted ion a feature map of the previous layer [26], then bias is added after this process, after which the corresponding element finally obtained by activating the function is added.
Where the convolution kernel size of the first ConvL C1 is 5 × 5 and the size of the convolution kernel of the base layer C2 and C3 is 3 × 3, relative to 5 × 5, and for better results, the latter two convolutions use 3 × 3, because two 3 × 3 increase the network's non-linear capabilities, making the decision function more discriminative. However, if the first layer used is of 3 × 3, it will make the entire network model parameters too little, meaning a decrease in performance.
The mathematical expression of the layer [27] is as follows: where l represents the layer, f represents the activation function, k is the convolution kernel, b is the bias, and Mj represents the feature map.

Convolutional Layers
One of the most important operations in the CNN is the convolutional layers (ConvL); CNN comprises one or more ConvL, and this latter is the basic building block that performs the core building block of a convolutional network that does most of the computational heavy lifting [25].
Like the traditional neural network, the input of each ConvL is the output of the upper layer, each of the feature graphs in the ConvL correspond to a kernel convolution of the same size and each of the feature maps of the ConvL is convoluted on a feature map of the previous layer [26], then bias is added after this process, after which the corresponding element finally obtained by activating the function is added.
Where the convolution kernel size of the first ConvL C1 is 5 × 5 and the size of the convolution kernel of the base layer C2 and C3 is 3 × 3, relative to 5 × 5, and for better results, the latter two convolutions use 3 × 3, because two 3 × 3 increase the network's non-linear capabilities, making the decision function more discriminative. However, if the first layer used is of 3 × 3, it will make the entire network model parameters too little, meaning a decrease in performance.
The mathematical expression of the layer [27] is as follows: where l represents the layer, f represents the activation function, k is the convolution kernel, b is the bias, and M j represents the feature map.

Pooling
The output feature maps obtained after the calculation of the ConvL are generally not greatly reduced in dimension. If the dimension does not change, a great amount of computation will be needed, and it will become very difficult to get a reasonable result with the network learning process [27]. The pooling layer is another important concept of CNNs that simplifies the output by performing nonlinear down-sampling, and reducing the number of parameters that the network needs to learn without changing the number of feature graphs. In this paper, the pooling layer is sampled with the maximum value. The sampling size is 2 × 2.

Rectified Linear Unit (RELU)
This is the most commonly used activation function in deep learning models, defined as the positive part of its argument, if the rectifier receives any negative input it will return to zero; it is defined as follows: f(x)= max(0,x)

Full-Connected Layer
For the network, after several convolutions and max-pooling layers, the high-level reasoning in the neural network is done via fully connected layers. All neurons in a fully connected layer have full connections to all activations in the previous layer, and these fully-connected layers form a multi-layer perceptron (MLP), which plays the role of a classifier.

Output Layer
The classifier layer is the output layer of the CNN; the softmax regression classifier is used in this paper [28,29]. The softmax is a multi-classifier that has a strong non-linear classifying ability and is used at the last layer of the network; first, we enter the data x for a given training.
Where the output category y belongs to {1, 2, . . . , k}, there are k classes in total; in this article we have set them to 10. It is assumed that the input data x are specified, the distribution probability of its class y = i is as follows, θi indicates the parameters to be fitted, e represents the base of the natural logarithm, and T represents the transpose. The meaning of P (y = i | x; θ) is the probability that the input data x corresponds to each class i can take values 1 to k.

Results and Discussion
In this paper, the tests were performed on a personal computer (PC) 64 bit system with an I7 2.4 GHz processor and 8 GB of RAM using MATLAB R2018b.

Performance Comparison of Enhancement Techniques
The evaluation and judgment of each type of image enhancement technique based on histogram equalization are given by the calculation of the parameters absolute mean brightness error (AMBE), and peak signal to noise ratio (PSNR) [30].

PSNR (Peak Signal to Noise Ratio)
The metrics are the following: where MSE is the mean square error, which requires two M × N grayscale images I and Î. The PSNR is defined as follows: The great PSNR value estimates the degree of contrast enhancement. Table 1 summarizes the results obtained. Another parameter is proposed to rate the performance in preserving image brightness; the absolute mean brightness error (AMBE) is defined by the following: Or, X m and Y m are the mean intensities of the input and output image respectively [31].
On the contrary to PSNR, the least value of AMBE indicates better brightness preservation; Table 2 shows the results obtained.

The Visual Comparison
The visual comparison of the facial image after enhancement is shown in this section (see Figures 5  and 6); the main goal is to judge if the enhanced facial image has a more natural appearance and is visually acceptable to the human eye. On the basis of visual observation, it can be concluded that the CLAHE technique provides better visual quality and a more natural appearance compared with other techniques.
After the visual observation, we focused on the impact of the clip-limit (CL) value and block size (Bs) of the CLAHE algorithm.
Firstly we fixed the Bs to [8 8] and varied the CL from 0.001 to 0.010, after which we calculated the PSNR values of each variation.
The PSNR results of CLAHE are shown in Figure 7. On the basis of visual observation, it can be concluded that the CLAHE technique provides better visual quality and a more natural appearance compared with other techniques.
After the visual observation, we focused on the impact of the clip-limit (CL) value and block size (Bs) of the CLAHE algorithm.
Firstly we fixed the Bs to [8 8] and varied the CL from 0.001 to 0.010, after which we calculated the PSNR values of each variation.
The PSNR results of CLAHE are shown in Figure 7. On the basis of visual observation, it can be concluded that the CLAHE technique provides better visual quality and a more natural appearance compared with other techniques.
After the visual observation, we focused on the impact of the clip-limit (CL) value and block size (Bs) of the CLAHE algorithm.
Firstly we fixed the Bs to [8 8] and varied the CL from 0.001 to 0.010, after which we calculated the PSNR values of each variation.
The PSNR results of CLAHE are shown in Figure 7. On the basis of visual observation, it can be concluded that the CLAHE technique provides better visual quality and a more natural appearance compared with other techniques.
After the visual observation, we focused on the impact of the clip-limit (CL) value and block size (Bs) of the CLAHE algorithm.
Firstly we fixed the Bs to [8 8] and varied the CL from 0.001 to 0.010, after which we calculated the PSNR values of each variation.
The PSNR results of CLAHE are shown in Figure 7. It can be observed from the figure that the CLAHE algorithm achieved the highest PSNR value at CL= 0.001 in the JAFEE and CK+ databases.
Secondly, we fixed the clip limit value at 0.01 and varied the block size from [2 2] to [128 128], after which we calculated PSNR values of each variation (see Table 3).

JAFFE Database
The JAFFE database consists of 213 grayscale images of 10 Japanese female models; these images are almost frontal poses including 7 facial expression images; each image has a size of 256 × 256 [32]. The following illustration of the database is shown in Figure 8. It can be observed from the figure that the CLAHE algorithm achieved the highest PSNR value at CL = 0.001 in the JAFEE and CK+ databases.
Secondly, we fixed the clip limit value at 0.01 and varied the block size from [2 2] to [128 128], after which we calculated PSNR values of each variation (see Table 3).

JAFFE Database
The JAFFE database consists of 213 grayscale images of 10 Japanese female models; these images are almost frontal poses including 7 facial expression images; each image has a size of 256 × 256 [32]. The following illustration of the database is shown in Figure 8.
Firstly, we process the pictures from the JAFFE database as follows: the size of all the images was reduced to 64×64 pixels.
After that, contrast limited adaptive histogram equalization (CLAHE) was used for the contrast enhancement.
Finally, we used 149 images for training (about 70% of the total) and 64 images for testing (about 30% of the total).
In Tables 4 and 5, N, A, D, F, H, Sa, and Su are used to represent seven basic expressions as neutral, anger, disgust, fear, happiness, sadness, and surprise, respectively. 8] for JAFEE and CK+ databases was used.

JAFFE Database
The JAFFE database consists of 213 grayscale images of 10 Japanese female models; these images are almost frontal poses including 7 facial expression images; each image has a size of 256 × 256 [32]. The following illustration of the database is shown in Figure 8.   The proposed method provided high recognition accuracy of 99.2% for disgust, 98.9% for surprise happiness and sadness; while anger, neutral, and fear had a high accuracy level but less than the previous facial expressions, with recognition accuracy of 98.5%-97.5% respectively. The JAFFE database achieved a recognition accuracy of 98.63%.

CK+ Database
The CK+ database consists of 593 images in total from 123 subjects that had a human facial emotion based on the subject's impression of each of the seven basic emotions [33].
The ages of participants are between 18 and 50, 69% of them are women, 81% Euro-American, 13% Afro-American, and 6% from other groups. Image sequences for frontal views and 30-degree views were digitized into either 640 × 490 or 640 × 480 pixel arrays. An illustration of the database is shown in Figure 9. Firstly, we process the pictures from the JAFFE database as follows: the size of all the images was reduced to 64×64 pixels.
After that, contrast limited adaptive histogram equalization (CLAHE) was used for the contrast enhancement.
Finally, we used 149 images for training (about 70% of the total) and 64 images for testing (about 30% of the total).
In Tables 4 and 5, N, A, D, F, H, Sa, and Su are used to represent seven basic expressions as neutral, anger, disgust, fear, happiness, sadness, and surprise, respectively.  The proposed method provided high recognition accuracy of 99.2% for disgust, 98.9% for surprise happiness and sadness; while anger, neutral, and fear had a high accuracy level but less than the previous facial expressions, with recognition accuracy of 98.5% -97.5% respectively. The JAFFE database achieved a recognition accuracy of 98.63%.

CK+ Database
The CK+ database consists of 593 images in total from 123 subjects that had a human facial emotion based on the subject's impression of each of the seven basic emotions [33].
The ages of participants are between 18 and 50, 69% of them are women, 81% Euro-American, 13% Afro-American, and 6% from other groups. Image sequences for frontal views and 30-degree views were digitized into either 640 × 490 or 640 × 480 pixel arrays. An illustration of the database is shown in Figure 9. Firstly, the pictures from the CK+ database are processed as follows: the size of all the images was reduced to 64 × 64 pixels. Firstly, the pictures from the CK+ database are processed as follows: the size of all the images was reduced to 64 × 64 pixels.
After that, for contrast enhancement, contrast limited adaptive histogram equalization (CLAHE) is used.
Finally, we used 415 images for training (about 70% of the total) and 178 images for testing (about 30% of the total). The proposed method provided high recognition accuracy of 100% for neutral; 99.7% for surprised, 99.4% for happy, while angry disgust sad and fear had lower accuracy between 93.7% and 98.5%. The CK+ database gives a recognition accuracy of 97.05%.
These results are satisfactory, but lower than those given using the JAFFE database; this is because the images in the CK+ database were captured in a more difficult pose and under challenging lighting conditions.

Results with and without Contrast Enhancement
In order to demonstrate the effect of CLAHE on the recognition rate, we made a comparison between two methods; the first was used without the application of CLAHE.
The recognition rate results without the application of CLAHE enhancement algorithm for the JAFEE and CK+ database are shown in Table 6. The second method with the application of CLAHE provided the results shown in Table 7. The comparison of the two methods with CLAHE and without CLAHE will show an improvement in the results, the recognition rate of the JAFFE database is improved by 1.9% for neutral, 1.4% for anger, 1.9% for disgust, 1.33% for fear, 2.27% for happiness, 2% for sadness, and 1.73% for surprise.

Comparison with Other Methods
In order to prove the effectiveness of our approach, the average recognition accuracy is compared with other approaches for FER. Tables 8 and 9 show the comparison of the recognition accuracy obtained with our approach and with other approaches for the JAFFE and CK+ databases. Table 8. The comparison between different approaches and our approach for the JAFFE face database. CNN-convolutional neural network.

Training Time
In this section, we compared the training times of the CNN algorithm and the proposed algorithm in both databases. The comparison results are shown in Table 10. It can be seen from Table 10 above that the training time of the proposed algorithm is much shorter than that of the CNN algorithm; this means that our approach has a higher training speed and efficiency.
In short, the proposed algorithm greatly outperforms the traditional algorithm in terms of speed, recognition accuracy, and efficiency.

Conclusions
This work presents a method of facial expressions recognition (FER) based on the Viola-Jones face detection algorithm, and facial image enhancement algorithms to improve image contrast. A comparative study of all these techniques has been presented. Through the results achieved after calculation of PSNR and AMBE parameters, we found that CLAHE outperforms all other techniques. Indeed, CLAHE clearly improves the contrast and brightness of the image more than the other enhancement techniques.
Then discrete wavelet transforms (DWT) and deep CNN are presented in this paper. Features extraction results of the face using DWT are the input to CNN network training, and the trained network is used for facial expressions recognition.
This network consists of three ConvL, two pooling layers, a fully-connected layer, and one softmax regression layer to classify and complete facial expressions recognition.
The results achieved on the JAFFEE and CK+ database confirm the effectiveness and robustness of our method. In experiments on the testing set of the JAFEE database and CK+ database, the expression recognition rate reaches up to 98.63% and 97.05%, respectively.