1. Introduction
Effective management of pain, including assessment and tracking overtime, is necessary to study the effectiveness of different treatments [
1,
2] and to avoid reaching chronic pain syndromes [
2]. Pain can be assessed either by patients’ self-reporting or by medical member observations [
1,
3]. The self-report method on a verbal or visual scale is the gold standard of pain assessment [
1,
4], but this method can be inaccurate and challenging in people with severely impaired communication ability (e.g., people with later stage dementia [
5], burn-injured adults [
1], neonates [
6], intensive care patients [
7], etc.). There are two common methods for pain assessment based on the observation principle: the Critical Care Pain Observation Tool (CPOT) and the Pain Assessment in Advanced Dementia Scale (PAINAD) [
1]. The CPOT and PAINAD tools are designed for patients unable to self-report [
1]. The CPOT assesses facial expressions, body movements, ventilator compliance, and muscle tension/rigidity, while the PAINAD assesses breathing, vocalization, facial expressions, body language, and consolability [
1]. Facial expressions form the basic components of pain assessment in both the CPOT and PAINAD tools. Studies have shown that periodically monitoring patient pain level in intensive care units in hospitals improved patient outcomes [
8]. However, the medical member observations methods are highly subjective; human observers may be influenced by personal factors [
6] and observers need to have the required experience [
9]. Further, contentious pain assessment monitoring may fail and not be sustained when there is a shortage of medical experts [
9]. These causes show the need for an automatic system to monitor a patient’s pain level [
6,
8,
9], which can also be used at home for elderly people [
9]. A previous study that compared the performance of a human observer’s pain assessment with that of an automated pain assessment from facial expressions concluded that the latter outperforms the former and is more reliable [
6]. However, the tasks of pain detection and assessment from the images is challenging due to the diversity in head poses and environments, such as the illumination conditions and intensive occlusion [
8,
9]. Furthermore, patients tend to show a high variance in their reactions and facial expressions to the same level of pain [
8].
The actions that form the facial expression of feelings, such as lowering the eyebrows or squinting, are called action units (
AU) and each is assigned a unique code (e.g.,
AU6 is lifting the cheeks). Regarding pain expression, six
AU’s (
AU4, 6, 7, 9, 10, and 43) have been found to be the most representative [
10]. By assigning numerical values to those six AUs, pain can be calculated via the Prkachin and Soloman pain intensity (
PSPI) equation, as shown below:
This PSPI score determines pain intensity on a scale of 0 (no pain) to 16 (maximum pain).
Below is a detailed explanation of the valuation of the six AUs used in PSPI:
AU4 is the intensity of lowering the eyebrows on a scale from 0 to 5 (0 = not lowered, 5 = maximally lowered).
AU6 is the intensity of raising the cheeks on a scale from 0 to 5 (0 = not raised, 5 = maximally raised).
AU7 is the intensity of tightening the eyelid on a scale from 0 to 5 (0 = not tight, 5 = very tight).
AU9 is the intensity of wrinkling the nose on a scale from 0 to 5 (0 = not wrinkled, 5 = very wrinkled).
AU10 is the intensity of rising the upper lip on a scale from 0 to 5 (0 = not raised, 5 = very raised).
AU43 is whether the eyes are closed; represented as a binary value (0 = opened, 1 = closed)
In this paper, we developed a new facial-expression-based automatic pain assessment system (FEAPAS). The model uses a dual convolutional neural network classifier to detect pain from facial expressions. The dual model better imitates the human brain’s visual perception [
11]. In this new model, we used the upper partition of detected faces, namely the eyes/brow area, and the full face as input images for our dual classifier. Because our goal was to produce an online model with a fast response time to generate alerts in a timely manner, we avoided an extensive computation time typically needed for deep learning system by using the optimal one of four pretrained networks (VGG16, InceptionV3, ResNet50, ResNeXt50) after freezing all the convolutional blocks and using the resulting weight in our shallow dual CNN classifier (e.g., transfer learning).
Using a camera, the proposed FEAPAS monitors a patient in bed; the backend code reads the video frame by frame and sends each frame to the classifier after detecting the patient’s face and extracting the upper face area. An alarm is activated if the classifier outputs a positive pain score. The specific frame is stored with the associated pain level, date, and time to include in the report for the medical team.
To achieve a fast and robust FEAPAS, the following two challenges must be overcome. First, the system’s performance heavily depends on the classifier’s performance; a classifier with high accuracy and a fast prediction process increases the system’s reliability. To obtain an efficient classifier and overcome the limited face shapes in the system training dataset, transfer-learning is used on four CNN models (VGG16, InceptionV3, ResNet50, ResNeXt50) via freezing convolutional layers and replacing the prediction layer with a shallow CNN. This is then tested with four different optimizers (SGD, ADAM, RMSprop, and RAdam). The following two critical measurements were considered when developing the optimal concurrent shallow CNN with VGG16/InceptionV3/ResNnet50/ResNeXt50 with frozen layers, i.e., the accuracy of 10-fold cross-validation and the accuracy of test data for an unseen subject. The resulting shallow model was embedded in our FEAPAS to generate timely and accurate alerts.
The second challenge is to speed up the system’s response time. To do so, we selectively sampled the frames being tested. Instead of testing every single frame in the input sequence, we tested two frame-selection approaches. The first approach tests two segments from each end of a video sequence (boundary test), while the second tests one segment in the middle of the video sequence frames (middle test).
The rest of the paper is organized as follows:
Section 2 reviews transfer learning models and optimizers;
Section 3 includes previous research on automatic pain assessment in chronological order;
Section 4 describes the dataset and the proposed models;
Section 5 presents the experimental results; and
Section 6 offers concluding remarks.
2. Background
The transfer learning method is used to speed up training process and improve the performance of new untrained models by using the weights of an existing model [
12]. The concept of transfer learning is modeled after human intelligence by using the current knowledge to solve new problems faster or better [
13]. In transfer learning, the top layers of a pretrained model are frozen and specific layers above it are added to create a new model [
12]. However, the benefit of transfer learning is affected by the data and task of the new model [
13]. We chose to test the four VGG16, InceptionV3,RestNet50, and ResNeXt50 as possible platforms for classifier for FEAPAS because they have been widely successful in image classification.
In the ILSVRC 2014 competition, Visual Geometry Group VGG16 was ranked the best method in localization and second best in classification [
12]. VGG16 addressed the depth of CNN architecture by adding more layers with small filters (3 × 3). VGG16 was trained on ImageNet with 1000 classes. The structure of VGG16 is shown in
Figure 1 [
14].
InceptionV1 won the ILSVRC 2014 classification. It was proposed to overcome the overfitting of the data in the deep convolutional neural networks by using multiple filters of different sizes on the same level [
15]. InceptionV3 is designed to optimize the performance of the previous versions of inception which suffer from high computations by applying factorized convolutions and aggressive regularization [
16].
The residual neural network (ResNet) won ILSVRC 2015 in image classification, detection, and localization. Moreover, it won MS COCO 2015 detection and segmentation. ResNet was inspired by VGG even though it is deeper and less complex [
17].
Neural networks may become less efficient as the number of layers increase and the model deepens [
18,
19,
20]. ResNet adds a direct connection to the layers of the network to solve the problem of vanishing gradients, which arises from the deep depth of the network. This connection preserves a certain percentage of the output of the previous network layers [
18,
19,
20]. Deep neural networks have more layers where wide neural networks have more kernels. However, a study comparing wider and deeper neural networks showed that shallow networks outperformed the much deeper residual networks in classification and segmentation [
20]. In fact, the number of trainable parameters is the cause of ResNet performance [
20]. The structure of ResNet50 is shown in
Figure 2 [
17].
ResNeXt was a runner up in ILSVRC 2015. It was designed to overcome the complexity in ResNet, which emerge from staking the modules of the same topology. ResNeXt inspired its structure from the inception model and used a simpler branch design method than inception [
21,
22].
The optimizer algorithm plays a critical role in the neural network training process. Studies have investigated the performance and declared a specific optimizer to work better with a particular problem [
23,
24,
25].
The stochastic gradient descent (SGD) is the popular algorithm for solving optimization problems, but it requires manually adjusting the learning rate decay [
23,
24]. To overcome this manual process in SGD, Diederik P. Kingma proposed adaptive moment estimation (ADAM). ADAM makes the model converge faster and occupies little memory [
24,
25]. Root mean square propagation RMSProp is an optimization algorithm first proposed by Geoffrey E. Hinton to speed up the model convergence through loss function optimization [
24]. Where ADAM and RMSProp suffer from the variance in adaptive learning rate, rectified ADAM (RAdam) has effectively solved it [
26,
27].
3. Related Work
The early studies in the field of pain assessment extracted the features from frames using the active appearance model (AAM) and used a support vector machine (SVM) classifier [
28,
29] to create automated pain assessment systems. Khan et al. [
30] later compared SVM in their proposed framework for pain along with three other classifiers (decision tree (DT), random forest (RF), and 2 nearest neighbors (2NN) based on Euclidean distance). Their framework detects the face from a frame, horizontally halves the detected face, and uses the halves as the two inputs of the model; it employs the shape information using a pyramid histogram of oriented gradients (PHOG) and the appearance information using a pyramid local binary pattern (PLBP) to obtain a unique representation of the face. However, more recent pain assessments utilize neural networks. Zhou et al. [
31] utilized the recurrent convolutional neural network (RCNN) to introduce a real-time regression framework for automatic pain intensity estimation, while Rodriguez et al. [
32] employed the advantage of combining CNNs with long short-term memory (LSTM) in their model. Some researchers even used dual models that consisted of a fusion structure of CNNs, such as Semwal and Londhe, who used two shallow neural networks—spatial appearance network (SANET) and shape descriptor network (SDNET)—in one model [
9], and multiple neural networks in another [
33].
Inspired by Khan et al. [
30], our proposed model used two inputs based on the detected face parts to mimic the PSPI code; however, instead of using both halves of the face we used the full face as one input and the upper face for the other input. We employed neural networks for automatic feature extraction from frames, just as [
31] and [
32] did. However, instead of RCNN or a combination of CNNs and LSTM, we employed the VGG16, InceptionV3, ResNet50, and ResNeXt50 each with a shallow CNN replacing the classifier layer.
The higher accuracy obtained by [
9] and [
33] encouraged us to adopt a fusion structure of CNNs.
We compared our model’s performance to the aforementioned models [
9,
30,
32,
33] that used the same dataset (UNBC-McMaster shoulder pain expression archive), as well as the measurement strategy (the k-fold cross validation accuracy), as ours. Vaish and Sagar’s state-of-the-art model [
34], which employed Kaze algorithm to extract features from the detected face, obtains a fisher vector and sends it to an SVM classifier. It also uses the same testing dataset and metric measurements. Therefore, we compared our results against it. The resulting optimal model is then embedded in our FEAPAS system.
5. Results
Table 3 shows the 10-fold cross validation results of the 16 models (precision, recall, and F1-score).
Table 4 shows the results of the phase that tested the two classifier platforms in terms of the 10-fold cross validation accuracy and data of unseen subject for 16 assortments made up of four models and four optimizers trained on the UNBC-McMaster shoulder pain expression archive dataset. All 16 models showed high accuracy and achieved more than 96.00% for 10-fold cross validation, except for ResNeXt50 with SGD. For an unseen subject, InceptionV3 with SGD and ResNeXt50 with SGD achieved 90.56% and 90.19, respectively, whereas ResNet50 with ADAM and InceptionV3 with RAdam achieved 88.21% and 86.10%, respectively. The rest of the models achieved less than 85.00%. As shown in
Figure 9, the SGD optimizer was more stable and less fluctuating in InceptionV3 training comparing with other optimizers (i.e., ADAM, RMSprop, and RAdam).
Analyzing the results of the trained models helped us decide which deep-learning framework worked better with our FEAPAS system. Based on the combination of the two accuracy values, InceptionV3 with SGD was selected as the models to be inserted in the FEAPAS system. We therefore conducted more experiments on the FEAPAS system to evaluate the overall performance.
Table 5 shows that our proposed approach outperformed previous approaches that were also conducted on the same UNBC-McMaster shoulder pain expression archive dataset and evaluated based on k-fold cross validation. The accuracy of the unseen subject was not provided by previous publications, so we could not compare.
Table 5 also shows whether the study used entire images in the dataset, a subset, or a variation of subset plus data collected from other sources.
To measure the impact of brightness and head rotation on the InceptionV3 with SGD model, we applied the model on the three adjusted testing data (unseen subject) and recorded the accuracy for each testing data.
Table 6 shows the impact of illumination and head rotation of the dataset on the InceptionV3 with SGD model’s accuracy, as compared to its 90.65% accuracy on the original testing data. As
Table 5 shows, increasing the brightness dropped the accuracy by 00.81% while decreasing the brightness and the head rotation had no negative effect on the efficiency of the model.
To test the response time of our system, we ran FEAPAS four times, i.e., twice for each frame selection approach using sequence length values (N = 30).
The experiment was conducted by playing the test videos on a screen and directing the laptop’s webcam to the screen to mimic a real live feed. A stopwatch was used to record the time, at which point the alarm was activated.
We tested the following four scenarios with a sequence length of N = 30. The middle and boundary frame selection strategies with a segment length of 2Δ are as follows:
One frame at each end of the sequence: two-boundary test frames: 2-B, Δ = 1
Two frames at the middle of the sequence: two-middle test frames: 2-M, Δ = 1
Two frames at each end of the sequence: four-boundary test frames: 4-B, Δ = 2
Four frames at the middle of the sequence: four-middle test frames: 4-M, Δ = 2
Table 7 shows the impact of frame selection approach in response time. The larger segment length required more frames to be classified, which thus led to longer response time. Finishing classification at the middle or at the end of the sequence was an important factor and should be considered in future studies. Moreover, 2-M-30 showed the lowest average response time of 6.49 s, and 4-B-30 showed the highest average response time of 29.86 s.
Figure 10 shows the output of FEAPAS on video 1 by using the 2-B-30 selected frame strategy. The FEAPAS stored the captured frames and classified them as pain before saving the date and time with the pain level.
6. Conclusions
We developed a new facial expression-based automatic pain assessment system to monitor patients and assist in the pain evaluation process. The FEAPAS was designed to recognize four classes: no pain, low pain, moderate pain, and severe pain. When a patient is detected to be in pain, FEAPAS activates an alarm to allow the medical team to take steps. While developing FEAPAS, we focused on two main criteria: (1) the system should be precise enough to not miss any true alarm and (2) fast enough to catch the pain situations and activate the alarm. Our proposed FEAPAS consisted of two subsystems each using a modified pretrained CNN. The modification included freezing the convolutional block and replacing the prediction layer with a shallow CNN. Each subsystem takes one of the two systems’ inputs (i.e., the full face and the upper face). Among the 16 tested model combinations (four pretrained CNN options, VGG16, InceptionV3, ResNet50, and ResNeXt50, and four possible optimizers, SGD, ADAM, RMSprop, and RAdam), the model with a InceptionV3 and the SGD optimizer excelled with an accuracy of 99.10% on 10-fold cross-validation, and a 90.56% score on the unseen subject data. (Future work should explore other optimizers to further improve system performance.) To speed up the response time in FEAPAS and avoid unnecessary alarms caused by momentary facial expressions, we classified few selected frames instead of classifying every single frame. Further, we tested, two frame selection approaches (i.e., at the two ends of the sequence and in the middle of the sequence) using sequence lengths of 30 frames with segment-lengths of two and four frames, respectively. FEAPAS correctly classified six online videos with 1611 frames (four videos recorded the severe pain situation and two other videos recorded the no pain situation) with an average response time of less than 30 s.