3D Facial Expression Recognition for Deﬁning Users’ Inner Requirements—An Emotional Design Case Study

.


Introduction
Donald A. Norman, in his book "Emotional design: Why we love (or hate) everyday things" and in other publications has stated that a design is a success only when the final product is successful in making customer buy it, use it, and enjoy it, and spread the word of the product to others [1,2]. He claimed that designers need to ensure that the design satisfies people's needs in terms of function, usability, and the ability to deliver emotional satisfaction, pride, and delight. In order to design a product that elicits positive emotions, it is necessary to define and be able to predict the target users' emotional responses early in the design process.
Emotional Design occurred with the object to promote positive emotions [1] or pleasure in users [3,4] thanks to design properties of products and services [5]. According to Van Gorp and Adams [6], since emotions influence decision making, affect memory and attention, and generate meaning, they can deeply affect the overall user experience. The essential part of the design process is the capability of understanding the user's feelings and emotions [7], and for this reason, user The assignment of specific numerical values was undertaken with matching between the answers given by a potential customer in a tailored interview and the results of a facial expression recognition (FER) procedure on the customer's face acquired via depth camera during this interview. All the interviews were both audio-recorded and recorded with the depth camera.
The distance sensing for the depth camera is mapped using coded-light technology (or structured-light). The coded-light technology consists of projecting a known pattern on the surface of an object, a face, or any other element and obtaining information on the depth of the surface according to the way in which the pattern is deformed. This technology is designed specifically for applications at close range and the kind of sensor specifically works best within the range 0.2-1.5 m. The projected pattern is of the IR (infrared) type, and is therefore out of the visible spectrum. The depth video format, as well as the frame-rate, is adjustable. Typically, the recommended resolution is the best available (640 × 480) while the frame-rate is 30 FPS (Frame per Second). The output of every frame is a depth map.
The reason for involving emotions through facial expression monitoring and evaluation was to measure the emotional involvement in every topic, i.e., so that weights are assigned according to a quantification of engagement of the user.
In a traditional QFD, the scale to be used to attribute the weights to the needs is usually a Likert scale of points 1 to 5. Differing from the standard approach, the degree of importance of needs was not directly asked to users through a questionnaire but obtained as output of the facial expression recognition algorithm. A model was searched for in the literature for matching weights (numerical values or degrees) to specific emotions, especially in the perspective of user involvement. Russell proposed an emotional model, called the "circumplex model", which is shown in Figure 2, that can The assignment of specific numerical values was undertaken with matching between the answers given by a potential customer in a tailored interview and the results of a facial expression recognition (FER) procedure on the customer's face acquired via depth camera during this interview. All the interviews were both audio-recorded and recorded with the depth camera.
The distance sensing for the depth camera is mapped using coded-light technology (or structured-light). The coded-light technology consists of projecting a known pattern on the surface of an object, a face, or any other element and obtaining information on the depth of the surface according to the way in which the pattern is deformed. This technology is designed specifically for applications at close range and the kind of sensor specifically works best within the range 0.2-1.5 m. The projected pattern is of the IR (infrared) type, and is therefore out of the visible spectrum. The depth video format, as well as the frame-rate, is adjustable. Typically, the recommended resolution is the best available (640 × 480) while the frame-rate is 30 FPS (Frame per Second). The output of every frame is a depth map.
The reason for involving emotions through facial expression monitoring and evaluation was to measure the emotional involvement in every topic, i.e., so that weights are assigned according to a quantification of engagement of the user.
In a traditional QFD, the scale to be used to attribute the weights to the needs is usually a Likert scale of points 1 to 5. Differing from the standard approach, the degree of importance of needs was not directly asked to users through a questionnaire but obtained as output of the facial expression recognition algorithm. A model was searched for in the literature for matching weights (numerical values or degrees) to specific emotions, especially in the perspective of user involvement. Russell proposed an emotional model, called the "circumplex model", which is shown in Figure 2, that can be discretized as it is placed on a Cartesian plane [36,37]. The x-axis quantifies the positivity/negativity of the emotion; the y-axis quantifies the emotional activation and involvement.  [36,37]. The x-axis quantifies the positivity/negativity of the emotion; the y-axis quantifies the emotional activation and involvement. Considering the canonical emotional tone of a professional interview regarding a product to be conceptualized, we considered for our model only the "positive quadrants", the first and the fourth. Weights 1, 2, 3, 4, and 5 have respectively been assigned to emotions "deactivation", "contentment", "pleasure", "excitement", and "arousal", according to the circumplex model. We also took into consideration a simplified model with weights 1, 3, and 5 assigned respectively to "deactivation", "pleasure", and "arousal", which could be adopted instead of the complete one, depending on the emotional involvement degree of the scenario, the topic, and the interviewed person. The two models are shown in Figure 3.  Considering the canonical emotional tone of a professional interview regarding a product to be conceptualized, we considered for our model only the "positive quadrants", the first and the fourth. Weights 1, 2, 3, 4, and 5 have respectively been assigned to emotions "deactivation", "contentment", "pleasure", "excitement", and "arousal", according to the circumplex model. We also took into consideration a simplified model with weights 1, 3, and 5 assigned respectively to "deactivation", "pleasure", and "arousal", which could be adopted instead of the complete one, depending on the emotional involvement degree of the scenario, the topic, and the interviewed person. The two models are shown in Figure 3. Appl. Sci. 2019, 9, x; doi: FOR PEER REVIEW www.mdpi.com/journal/applsci be discretized as it is placed on a Cartesian plane [36,37]. The x-axis quantifies the positivity/negativity of the emotion; the y-axis quantifies the emotional activation and involvement. Considering the canonical emotional tone of a professional interview regarding a product to be conceptualized, we considered for our model only the "positive quadrants", the first and the fourth. Weights 1, 2, 3, 4, and 5 have respectively been assigned to emotions "deactivation", "contentment", "pleasure", "excitement", and "arousal", according to the circumplex model. We also took into consideration a simplified model with weights 1, 3, and 5 assigned respectively to "deactivation", "pleasure", and "arousal", which could be adopted instead of the complete one, depending on the emotional involvement degree of the scenario, the topic, and the interviewed person. The two models are shown in Figure 3.  During the interview, which is organized with the potential user to define his/her needs, a depth camera should be placed in front of him/her to acquire the face frame-by-frame. The camera is started when the interview starts, together with a vocal recording. Then, the vocal recording is analysed and notes should be taken about the exact range of seconds in which the interviewer's question is about to finish and the interviewee starts answering. This is supposed to be the very moment in which the micro-expression of the inner emotion (the "ground truth feeling") is displayed by the face. Thus, these are the moments in which the degree of actual emotional involvement of the user could be measured to define the numerical values of the needs weight in QFD.
Facial data undergo a selection of the significant frames, which are then post-processed so that a depth map remains, framing the face alone. Then, an algorithm is run on the facial depth map to automatically localize 17 landmarks [38], shown in Figure 4, with a thresholding methodology [39] based on differential geometry descriptors [40].
Appl. Sci. 2019, 9, x; doi: FOR PEER REVIEW www.mdpi.com/journal/applsci During the interview, which is organized with the potential user to define his/her needs, a depth camera should be placed in front of him/her to acquire the face frame-by-frame. The camera is started when the interview starts, together with a vocal recording. Then, the vocal recording is analysed and notes should be taken about the exact range of seconds in which the interviewer's question is about to finish and the interviewee starts answering. This is supposed to be the very moment in which the micro-expression of the inner emotion (the "ground truth feeling") is displayed by the face. Thus, these are the moments in which the degree of actual emotional involvement of the user could be measured to define the numerical values of the needs weight in QFD.
Facial data undergo a selection of the significant frames, which are then post-processed so that a depth map remains, framing the face alone. Then, an algorithm is run on the facial depth map to automatically localize 17 landmarks [38], shown in Figure 4, with a thresholding methodology [39] based on differential geometry descriptors [40]. Euclidean distances are calculated between landmarks by adopting the pronasale as an anchor point, as shown in Figure 5. Thus, distances between every landmark and the pronasale have been computed. These distances are adopted as features to be inputted in a classification methodology based on the support vector machine (SVM). Other classificators could also be adopted, depending on the cardinality of the dataset. The classification is obtained by providing in advance a number of clusters in which to funnel the data. These clusters represent the emotions; thus, they will be three or five in number, depending on the chosen model, namely the simplified or the complete model.
Once the list of needs is defined according to the interview topics and answers, every facial frame to be adopted in the facial expression recognition methodology will be labelled with a specific need, so that the result of the emotion analysis will provide an emotional involvement outcome for that specific customer's need. If a facial frame, labelled with need X, is in cluster 4, representing These distances are adopted as features to be inputted in a classification methodology based on the support vector machine (SVM). Other classificators could also be adopted, depending on the cardinality of the dataset. The classification is obtained by providing in advance a number of clusters in which to funnel the data. These clusters represent the emotions; thus, they will be three or five in number, depending on the chosen model, namely the simplified or the complete model.
Once the list of needs is defined according to the interview topics and answers, every facial frame to be adopted in the facial expression recognition methodology will be labelled with a specific need, so that the result of the emotion analysis will provide an emotional involvement outcome for that specific customer's need. If a facial frame, labelled with need X, is in cluster 4, representing excitement, then value 4 will be added in the set of possible weight values of need X. When the final weight of need X is to be assigned to proceed with the QFD, a median is computed among the available values obtained with the clustering. The whole methodology concerning needs is summed-up in Figure 6.

Case Study
The method has been applied to a case study in the agricultural field. The final objective is to design an e-learning path for farmers so that they could embed recent technologies in their daily farming practices. The QFD adoption appeared suitable for understanding the current needs in the field and for correlating needs with services.
Thirteen people coming from the field of interest have been interviewed with an interview form, which has been developed in the European project "Farmer 4.0". The interview, reported in the Appendix, is approximately 40 min long and is composed of 74 questions. All the interviews were

Case Study
The method has been applied to a case study in the agricultural field. The final objective is to design an e-learning path for farmers so that they could embed recent technologies in their daily Appl. Sci. 2019, 9, 2218 7 of 15 farming practices. The QFD adoption appeared suitable for understanding the current needs in the field and for correlating needs with services.
Thirteen people coming from the field of interest have been interviewed with an interview form, which has been developed in the European project "Farmer 4.0". The interview, reported in Appendix A, is approximately 40 min long and is composed of 74 questions. All the interviews were audio-recorded; one interviewed person was recorded with the depth camera Intel RealSense SR300. Thus, the 13 audio recordings supported the conceptualization of the needs list and the 3D video-recording was adopted for assigning weights. The person who was recorded with the depth camera was the most representative of the whole group, as he was used to interviews for agriculture projects, to public speaking, and in general with social events and media. He was evaluated as the most suitable candidate for emotion involvement analysis. Figure 7 shows the unprocessed depth map given by the camera and a picture taken during the interview.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 8 of 16 Appl. Sci. 2019, 9, x; doi: FOR PEER REVIEW www.mdpi.com/journal/applsci audio-recorded; one interviewed person was recorded with the depth camera Intel RealSense SR300. Thus, the 13 audio recordings supported the conceptualization of the needs list and the 3D videorecording was adopted for assigning weights. The person who was recorded with the depth camera was the most representative of the whole group, as he was used to interviews for agriculture projects, to public speaking, and in general with social events and media. He was evaluated as the most suitable candidate for emotion involvement analysis. Figure 7 shows the unprocessed depth map given by the camera and a picture taken during the interview. 138 frames and subsequent facial depth maps were selected and post-processed on MATLAB, relying on the identification of the key moments of the interview between every question and every answer. They have been "need labelled" so that the results of the FER method could support the assignment of weights accordingly. Figure 8 shows some processed depth maps in which the face has been framed. 138 frames and subsequent facial depth maps were selected and post-processed on MATLAB, relying on the identification of the key moments of the interview between every question and every answer. They have been "need labelled" so that the results of the FER method could support the assignment of weights accordingly. Figure 8 shows some processed depth maps in which the face has been framed. audio-recorded; one interviewed person was recorded with the depth camera Intel RealSense SR300. Thus, the 13 audio recordings supported the conceptualization of the needs list and the 3D videorecording was adopted for assigning weights. The person who was recorded with the depth camera was the most representative of the whole group, as he was used to interviews for agriculture projects, to public speaking, and in general with social events and media. He was evaluated as the most suitable candidate for emotion involvement analysis. Figure 7 shows the unprocessed depth map given by the camera and a picture taken during the interview. 138 frames and subsequent facial depth maps were selected and post-processed on MATLAB, relying on the identification of the key moments of the interview between every question and every answer. They have been "need labelled" so that the results of the FER method could support the assignment of weights accordingly. Figure 8 shows some processed depth maps in which the face has been framed. The landmark localization algorithm was then run on the faces (Figure 9) and Euclidean distances were computed. According to the type of data and the emotions displayed by the interviewee's face, the complete model with weights 1, 2, 3, 4, and 5 was chosen in this case. Thus, five clusters were adopted. Clustering results allowed us to "cluster label" every facial frame so that the median was computed among the obtained weights for every need. The resulting QFD matrix is shown in Figure 10. The landmark localization algorithm was then run on the faces ( Figure 9) and Euclidean distances were computed. According to the type of data and the emotions displayed by the interviewee's face, the complete model with weights 1, 2, 3, 4, and 5 was chosen in this case. Thus, five clusters were adopted. According to the type of data and the emotions displayed by the interviewee's face, the complete model with weights 1, 2, 3, 4, and 5 was chosen in this case. Thus, five clusters were adopted. Clustering results allowed us to "cluster label" every facial frame so that the median was computed among the obtained weights for every need. The resulting QFD matrix is shown in Figure 10. Clustering results allowed us to "cluster label" every facial frame so that the median was computed among the obtained weights for every need. The resulting QFD matrix is shown in Figure 10.

Discussion
The obtained final weights underwent a validation, relying on the competencies of the project consortium. The results were examined and evaluated as suitable to every need taken into consideration for this QFD. In the literature, the development of a QFD model that provides technical design elements derived from psychological and behavioural needs was studied in several fields of application [42][43][44][45].
Generally speaking, the information obtained by QFD permits the determining of product improvement directions and sets targets and specifications for the product. The QFD approach decreases time for market and costs as the probability of mid-course corrections and implementation errors is reduced [43,[46][47][48][49].
In particular, the advantage of the present methodology compared with other traditional QFD approaches is that the weight assignation does not require any work from the interviewee, as the method determines the weight values relying only on the detected facial expressions. The counterpart is that the proposed technique requires a depth camera, which, despite its low cost, may not be available for the research group. As well as this, it requires some processing of the data and knowledge about clustering techniques. Nonetheless, several software and routines in this sense are already open and available for the community.
The methodology is still preliminary and not ready for all application contexts yet. The next steps of research will strengthen the FER methodology by adding new geometrical features so that the clustering can be more robust. A psychologist will be asked to evaluate the emotions of the costumer during the interview so that a validation of the SVM classification can be provided.

Discussion
The obtained final weights underwent a validation, relying on the competencies of the project consortium. The results were examined and evaluated as suitable to every need taken into consideration for this QFD. In the literature, the development of a QFD model that provides technical design elements derived from psychological and behavioural needs was studied in several fields of application [42][43][44][45].
Generally speaking, the information obtained by QFD permits the determining of product improvement directions and sets targets and specifications for the product. The QFD approach decreases time for market and costs as the probability of mid-course corrections and implementation errors is reduced [43,[46][47][48][49].
In particular, the advantage of the present methodology compared with other traditional QFD approaches is that the weight assignation does not require any work from the interviewee, as the method determines the weight values relying only on the detected facial expressions. The counterpart is that the proposed technique requires a depth camera, which, despite its low cost, may not be available for the research group. As well as this, it requires some processing of the data and knowledge about clustering techniques. Nonetheless, several software and routines in this sense are already open and available for the community.
The methodology is still preliminary and not ready for all application contexts yet. The next steps of research will strengthen the FER methodology by adding new geometrical features so that the clustering can be more robust. A psychologist will be asked to evaluate the emotions of the costumer during the interview so that a validation of the SVM classification can be provided.

Conclusions
A new QFD methodology is proposed in this work, in which needs weights are assigned relying on detected facial expressions. Russell's circumplex model of emotions is used as theoretical background to discretize the emotion interval and make them correspond to numerical values representing degrees of emotional involvement. During interviews, the face of the user should be acquired frame-by-frame with a depth camera, which allows for the obtaining of 3D models of the face. These data are processed and classified with a support vector machine and weights are assigned according to the result of this classification.
The approach has been applied to a case study of a European project in agriculture in order to understand the needs of agricultural entrepreneurs. The method, validated by the project consortium, appears to accurately weight the inner needs, and the resulting QFD worked appropriately for the project purposes.
This technique is new in the field of product design, as it merges knowledge from the user-centred design field and pattern recognition techniques coming from a computer vision background. Even if the method is still preliminary, this approach could open a vein for designing new products which respond to the inner needs of the customer.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. 12 (from 1 to 10): • The evaluation path of the user in terms of skills and abilities ________________________________________________________________________ • The training path ________________________________________________________________________ • The simulative environment ________________________________________________________________________ • The job shadowing ________________________________________________________________________ • The co-working ________________________________________________________________________ 31. Could it be useful to represent in the simulation environment a virtual FabLab in which the user can move at 360 • to view typical FabLab equipment and tools and learn how they work?