A CNN-Based Advertisement Recommendation through Real-Time User Face Recognition

: The advertising market’s use of smartphones and kiosks for non-face-to-face ordering is growing. An advertising video recommender system is needed that continuously shows advertising videos that match a user’s taste and displays other advertising videos quickly for unwanted advertisements. However, it is difﬁcult to make a recommender system to identify users’ dynamic preferences in real time. In this study, we propose an advertising video recommendation procedure based on computer vision and deep learning, which uses changes in users’ facial expressions captured at every moment. Facial expressions represent a user’s emotions toward advertisements. We can utilize facial expressions to ﬁnd a user’s dynamic preferences. For such a purpose, a CNN-based prediction model was developed to predict ratings, and a SIFT algorithm-based similarity model was developed to search for users with similar preferences in real time. To evaluate the proposed recommendation procedure, we experimented with food advertising videos. The experimental results show that the proposed procedure is superior to benchmark systems such as a random recommendation, an average rating approach, and a typical collaborative ﬁltering approach in recommending advertising videos to both existing users and new users. From these results, we conclude that facial expressions are a critical factor for advertising video recommendations and are helpful in properly addressing the new user problem in existing recommender systems.


Introduction
As the number of Internet users increases, users' interactions with multimedia devices are also growing rapidly. Recently, in fast food restaurants and cafes, non-face-to-face services that take orders through a smartphone app or a kiosk are increasing rather than receiving orders through people. Moreover, because of COVID-19, non-face-to-face orders are increasing at an explosive rate. Advertisers have not missed this opportunity and want to place various advertisements while processing orders through kiosks or smart apps. In addition, online video platform providers such as YouTube and TikTok are working to increase advertising revenue by commercializing video products [1]. Particularly, video content providers have introduced a recommender system to attain more website traffic by making it easy for users to find their favorite videos.
However, online advertising video or video recommendations face a unique problem. Users often become bored with advertising videos while watching them, even if they run for less than 30 s [2]. It is challenging to capture changes in user interests from profiles using users' historical data over a brief period. Recent advances in digital image processing technology have made it possible to track the facial expressions of online advertising video viewers in real-time. Some researchers have argued that facial expressions are important in predicting a user's preference [3,4]. For example, a user frowns when watching brutal scenes. That is, users' facial expressions make it possible to predict their emotions. Therefore, we can utilize facial expressions to advertise video recommendations to users. They allow us to discover a user's dynamic preferences while watching an advertising video.
In this study, we propose an advertising video recommendation procedure based on facial expression changes to cope with users' dynamic preferences within a short period. When a user is watching an advertisement, their emotion is captured by a webcam mounted on a kiosk, monitor, or smartphone. Then, their emotions are compared with those of other users to decide whether to keep the advertising video the user is watching or quickly replace it with another advertising video. To achieve the purpose of this study, we adopted a deep learning approach. Many studies have proposed a recommendation approach named collaborative filtering (CF), a traditional technique that recommends items suitable for users based on neighbors' preferences or purchasing history [5][6][7]. However, using only such a CF technique creates a "cold start" issue, whereby the recommendations for new users suffer from unpredictability because of a lack of historical data on their past purchases. Additionally, it has a "first start" issue that cannot offer recommendations until a user has reflected their preferences. In addition, there are also issues regarding the scalability of the model, arising from the continued growth of users' purchasing history or preference data [8][9][10]. In other words, it is challenging to offer recommendations in real time because it takes a long time to calculate data owing to model scalability problems. Many studies have been conducted to supplement issues such as data sparsity and scalability [11][12][13]. The recent deep learning approach has shown high performance in image processing or natural language processing areas and has received much attention [14,15]. Many studies that applied the deep learning approach have steadily been proposed in the study recommender systems [16][17][18]. However, recommender systems using deep learning techniques have focused on encoding various types of input data, which was not possible in traditional CF. Recommender systems using deep learning have been developed to improve the performance of recommender systems by reflecting vast amounts of internet review data or images in existing sales data or service use records. However, as is the case with the system we propose, a system that directly receives video information and recommends it without any prior information about its user has not yet been developed.
Therefore, this study applies a deep learning approach to offer personalized advertising videos by capturing new users' facial expressions in real time. Though the proposed procedure follows the principle of CF, we create a dynamic profile of users based on the changes in their facial expressions every moment instead of using their historical records. To achieve this purpose, we developed a recommender system that applied a convolutional neural network (CNN) to predict how a user likes the currently watched advertising video by recognizing their facial expressions. Additionally, to recommend new advertising videos to users, we developed a scale-invariant feature transform (SIFT), an algorithm-based similarity model, to search users with similar preferences in real time. To evaluate the proposed procedure, we compare the performance of three different recommendation approaches, a random system, an average rating-based (best-selling) system, and a typical CF-based system as benchmark systems using eleven food advertising videos. The experiment results indicate that the performance of the proposed procedure is better than that of the benchmark systems in advertising video recommendations to both existing users and new users. The main contributions of this study are as follows:

•
The proposed methodology in this study effectively captures user facial expressions in real-time through a deep learning approach to solve information overload and data sparsity problems.

•
The proposed methodology can effectively recommend advertising videos without distinguishing between new users (with no history of watching or rating) and existing users (whose previous records or ratings are available). Therefore, the proposed methodology does not suffer from the new user problem.

•
The study conducts several experiments that use real-world facial expressions data to demonstrate that the performance of the proposed methodology was excellent in the existing recommendation approach. We also found that facial expressions are an important factor for advertising video recommendations.
The rest of this study is organized as follows. Section 2 discusses related work on deep learning-based recommender systems, SIFT, and CNN. Section 3 discusses the proposed methodology and its components in detail. Section 4 describes the experimental design and discusses the results of this study. Section 5 summarizes the study, describes its limitations, and presents ideas for future work.

Deep Learning-Based Recommender Systems
Recommender systems are information-filtering systems that can solve the information overload problem by filtering important pieces of information among information generated according to a user's interests, preferences, or observed behavior for a specific item [19,20]. The information overload problem is a phenomenon in which it becomes more difficult for individuals to make good decisions as the amount of data increases [21]. With the rapid spread of Internet use between the early and mid-1990s, recommender systems based on CF were developed to estimate which of the many items users would like; thus, the problem of information overload was solved [22,23]. The objective of the recommender system is to make meaningful recommendations from user data for items or products of interest [24]. While customers have more choice, internet shopping malls and service providers face challenges in providing personalized product advertisements to customers. The recommender system collects user preference information for items such as movies, songs, books, travel destinations, and websites. CF is an algorithm that recommends items liked by other users with similar tastes, and word of mouth is implemented as a computer system. CF has been widely used as a recommendation methodology to this day, but it is also prone to the problem of performance degradation owing to data sparsity, as well as cold-start problems, long tail problems, and scalability problems [25,26].
Deep learning has advanced the structure of recommender systems and provides several ways to improve the performance of recommender systems. The development of deep learning-based recommender systems has received great attention because these systems can overcome the limitations of existing CF-models (e.g., data sparsity problem, cold-start problem, scalability problem, and long tail problem) and achieve high recommendation quality [7,8,[27][28][29][30]. Li, et al. [31] extracted latent features of user preferences or ratings using restricted Boltzmann machines and an undirected two-layer graphic model as a kind of graphic probabilistic model. Hu, et al. [32] extracted high-level features from low-level features for user preferences through a deep belief network, which is a deep neural network composed of multiple layers of latent variables. Auto Encoder, a model of deep learning, is used to reduce the dimensionality of the user item matrix and extract more latent features using the encoder output [33,34]. Ko, et al. [35] analyzed user behavior changes over time using a recurrent neural network (RNN), a deep learning model specialized in processing sequence data, and combined analysis results from RNN and the latent factor regarding user preferences to offer more accurate recommendations. In addition, the CNN, which shows excellent performance in tasks such as image recognition and object classification, extracts latent factors and latent features from raw data such as audio data, text data, and image data, thereby providing good performance for recommender systems [36,37]. Zhang, et al. [38] proposed a recommender system to solve a problem that makes it difficult for viewers to find interest anchors. Therefore, the authors developed a multi-head connected device to capture the preferences between anchors and viewers and extract related functions for the expression. The results show that the proposed model outperforms state-of-the-art recommendation models.
In summary, deep learning-based recommender systems have been developed to predict a user's preference or recommend items suitable for said user's preference by extracting the hidden meaning between the user and the item from various types of input data such as images and unstructured data, which were previously impossible to analyze. However, it is difficult to find a deep learning-based recommender system that can recommend in real time by recognizing only a user's facial expression without using pre-registered data or information such as purchase records or images. In terms of existing related studies, as shown in Table 1, most researchers collect user facial expressions and extract features to build dynamic user profiles. However, most studies applied simple heuristic techniques instead of computer vision approaches to extract user facial features. Additionally, most studies applied the CF method in the recommendation phase. However, there are many limitations to this traditional approach because it is critical to capture a user's facial expression and recommend advertisements quickly and accurately in real time. Users can quickly become bored, even with short videos. However, it is difficult to capture changes in user interest in a profile using the user's historical data for a short period. Therefore, it has become necessary to capture a user's facial expressions accurately and efficiently in real-time. In this study, we develop a recommender system through a comprehensive deep learning approach to recognize user expressions and predict how they enjoy the advertisement they are currently watching. This study elaborately extracts user facial expression characteristics in real time by applying SIFT, widely used in computer vision, to reduce the existing gap in such studies. We also applied CNN to accurately predict user preferences based on the extracted facial expression. The proposed methodology can effectively recommend advertising videos without distinguishing between new users, and it does not suffer from the new user problem. This study proposed a facial recognition integrated recommender system to address profiling problems (user information recognition) in retail stores. Based on this, the user's information (gender, age, and facial expression) is predicted by recognizing the face.
Enhancing recommender systems for TV by face recognition De Pessemier, et al. [41] 2016 This study proposed a system that supplemented the TV content recommender system by detecting and recognizing the facial emotions of users watching TV. Through the proposed system, a task was conducted to predict age, gender, and emotion. This study used facial recognition technology to predict user emotions and developed a music recommender system. To this end, faces were recognized using AdaBoost, and facial emotions were classified into eight categories through an SVM.

Scale-Invariant Feature Transform
Feature detection and image matching are the most critical tasks in the field of machine vision. Because they have different computational efficiency or accuracy depending on which feature detector and descriptor extraction algorithm are used, it is critical to choose a suitable algorithm according to feature matching tasks [43]. Algorithms such as SIFT, speeded up robust feature (SURF), and binary robust invariant scalable keypoint (BRISK) are mainly used, and each algorithm has differences in performance [44]. SIFT is known to be most robust in feature detection and matching. In particular, its robustness is apparent in terms of image scale, rotation, and affine transformation. SURF is based on SIFT and changes filter size instead of image pyramid. SURF is robust in relation to image scale and rotation, but weak for affine transformation; another advantage is its relatively fast calculation speed compared with SIFT [45]. BRISK is a binary corner detection algorithm robust to changes in image scale and rotation. It has the advantage of enabling a faster calculation than SIFT and SURF [46]. According to previous studies, SIFT was a little disadvantageous in terms of speed but had excellent accuracy [43,47]. Therefore, we perform feature extraction by applying the SIFT algorithm to accurately and effectively extract features included in user facial expressions.
The SIFT algorithm is an image descriptor developed for image-based matching and recognition [48,49]. The image descriptor is a means of expressing images such as keyframe and face, and it is used for image comparison when extracting scene transitions and searching for similar images. The SIFT algorithm extracts characteristic points unique to images, and it is also a strong algorithm in the detection of many environmental changes such as image size change, deformation, rotation, lighting change, and so on. The SIFT algorithm has proven to be useful for experimentally measuring the similarity between images, matching images, and object recognition [50]. The SIFT algorithm consists of four main processes: scale-space extrema detection, orientation assignment, keypoint localization, and keypoint descriptor. Scale-space extrema detection creates scale-space and detects extrema. Scale-space refers to images obtained by creating an image pyramid by resizing an original image with various scales and then using the image pyramid's octave image as a growing Gaussian blur scale factor. After obtaining the scale-space, the difference of Gaussian (DoG) image is obtained. For DoG image generation, the difference in the gaussian function is used to perform subtraction operations with two different Gaussian-blurred images in octave, and potential interest points are identified. Finally, extrema detection is performed using DoG. If the coordinates are considered to be the local minimum point or the local maximum point, they are classified into a group of keypoint candidates. In DoG target , the target pixel (x-indicated) value is compared to a total of 26 surrounding pixels (gray points) to determine whether it is a vertex. In the keypoint localization phase, the candidate keystone is not located on the correct coordinate system and is unsuitable for size and location determination. Therefore, an unstable candidate keypoint is removed depending on stability measurements, and only a stable keypoint is selected. The orientation assignment phase determines the gradient direction for each keypoint based on the local image patch. The local image patch determines a 16 × 16 patch around the keypoint, and after Gaussian blurring the image in it, the orientation and size of the gradient are determined for each point. In the final phase, a keypoint descriptor is created to express the characteristics of the keypoint. The final phase uses the extracted keypoint and descriptors to match keypoint; this is called keypoint matching. Keypoint matching is an algorithm that calculates the Euclidean distance for each keypoint between two images and matches the closest keypoint. Image matching tasks such as object detection, recognition, image retrieval, and tracking are among the most challenging tasks in the field of computer vision. Keypoint matching has been reported to have good results in object recognition and object detection [49]. Therefore, this study developed a SIFT algorithm-based similarity model to search for users with similar preferences in real time by comparing facial expression changes between two users using keypoint matching.

Convolution Neural Networks
CNN is known to be the most active field of research in deep neural networks [51,52] and was first introduced in the study "Backpropagation applied to handwritten zip code recognition" published by LeCun in 1989 [53]. LeCun later proposed the first CNN, a network called LeNet, in 1998 [54]. CNN is mainly used to solve difficult pattern recognition tasks, mainly focusing on images, and consists of an accurate but simple architecture [52]. Recently, CNN has been recognized as a powerful tool that can be used in various areas such as face, image, video, and voice analysis [51]. CNN is similar to traditional artificial neural network (ANN), which are composed of neurons that selfoptimize through learning [52]. However, CNN is more proficient at reducing the number of ANN parameters. This allowed researchers and developers alike to solve tasks that could not be solved with a classic ANN, allowing them to access larger models. CNN assumes that features do not have spatial dependence in problem solving [51]. For example, in face recognition, a face is recognized even if that face exists in an arbitrary position in a given image. Another important aspect of CNN is that abstract features are extracted when an input propagates to a deeper layer; as layers are stacked, local features are advanced to global features, thus solving the problem of not reflecting the entire relationship of the image, which is a disadvantage of fully connected neural networks. As a result, it has robustness in the transformation of input data. NN creates a feature map from an input image through a convolution filter. If it extracts several different features, it can set the number of convolution kernels to extract different features. Sub-sampling reduces the size of the feature map and, through it, topology invariance can also be obtained. After several stages of convolution and sub-sampling, the size of the feature map decreases, leaving only the robust features that can represent the whole. The global feature obtained in this way is connected to the fully connected network (FCN) input. Like the feature of ANN, it is possible to produce an optimal recognition result through learning. In this study, the user's face image is extracted and represented as a matrix in 3D. The difference between the current image matrix and the previous image matrix is obtained, and this is defined as a face change image. We developed a CNN-based rating prediction model trained on the obtained face change image over time and predicted a rating with FCN.

A Face Recognition-Based Recommender System
This study aims to develop a user-customized advertising video recommender system to search for similar users in real time and to recommend advertisements suitable for users' preferences. A CNN-based prediction model was developed to predict a user's rating while watching an advertising video in real time before watching the advertisement. It uses a CNN that learns the facial expression changes in people who have seen the advertising video in the past and the user's real-time ratings. Additionally, a SIFT algorithm-based similarity model was developed to search for users with similar preferences in real time. Using the keypoint matching of the SIFT algorithm, the user's facial expression changes over time are compared with other users to find neighbors similar to the user. To summarize the flow of our proposed recommender system, it is to continuously predict the rating of the advertising video using CNN while the target user is watching the advertisement. Additionally, when the predicted rating is below a certain threshold value, similar neighbors are found through the SIFT model, and advertising videos that have been evaluated by the neighbors in the past are recommended to the target user.

Overall Process
The overall process of our suggested methodology is shown in Figure 1. When a user watches an advertisement in front of the webcam, their appearance is photographed through the webcam. In addition, the user who watches the advertisement assigns his/her preference for each advertisement on a five-scale score. Then, the face image of the user in the video is extracted from the video at 0.5 s intervals. The rating prediction process is composed of independent CNN models every 0.

Data Collection and Data Pre-Processing
A user who views an advertisement rates the advertisement in question. Therefore, user i's rating for advertisement j, , can be expressed as Table 2, and Table 3 is an example of actual experimental data. Table 2. User rating on each item. Table 3. Actual user rating data.
The user's face image is extracted each time from the real-time image of the user's face who watches the advertisement. The face image of the user in the video uses a face detection model using CNN based on max-margin object detection (MMOD). Images are captured at 0.5 s intervals from the video, and the user's face in the image is recognized and extracted, as shown in Figure 2. Our methodology proceeds to predict users' ratings and search for similar users from the previously extracted images.
The face data for each time extracted through the face detection model can be represented as shown in Table 4.

Data Collection and Data Pre-Processing
A user who views an advertisement rates the advertisement in question. Therefore, user i's rating for advertisement j, r i,j can be expressed as Table 2, and Table 3 is an example of actual experimental data.  The user's face image is extracted each time from the real-time image of the user's face who watches the advertisement. The face image of the user in the video uses a face detection model using CNN based on max-margin object detection (MMOD). Images are captured at 0.5 s intervals from the video, and the user's face in the image is recognized and extracted, as shown in Figure 2. Our methodology proceeds to predict users' ratings and search for similar users from the previously extracted images.
The data collection and preprocessing are shown in Figure 3. When user i watches an advertisement about item j, user facial data and user rating data are accumulated and matched for later use. In the similar user search algorithm, the face image is used as input data because similarity is measured by finding singularities in the face. However, as the rating prediction algorithm predicts the rating according to the degree of change in the image, the amount of change in the image is used as input data. The amount of change in the image is calculated as the difference in matrix (pixel) values between two adjacent images, as in Equation (1): The face data for each time extracted through the face detection model can be represented as shown in Table 4.
The data collection and preprocessing are shown in Figure 3. When user i watches an advertisement about item j, user facial data and user rating data are accumulated and matched for later use.
The data collection and preprocessing are shown in Figure 3. When user i watches an advertisement about item j, user facial data and user rating data are accumulated and matched for later use. In the similar user search algorithm, the face image is used as input data because similarity is measured by finding singularities in the face. However, as the rating prediction algorithm predicts the rating according to the degree of change in the image, the amount of change in the image is used as input data. The amount of change in the image is calculated as the difference in matrix (pixel) values between two adjacent images, as in Equation (1): In the similar user search algorithm, the face image is used as input data because similarity is measured by finding singularities in the face. However, as the rating prediction algorithm predicts the rating according to the degree of change in the image, the amount of Appl. Sci. 2021, 11, 9705 9 of 19 change in the image is used as input data. The amount of change in the image is calculated as the difference in matrix (pixel) values between two adjacent images, as in Equation (1): An example is shown in Figure 4.

CNN-Based Rating Prediction Model
In deep learning, the deeper the neural network (NN), the better the performance; however, it is difficult to train. In particular, a gradient vanishing problem and a gradient exploding problem cause a problem in which the training error increases as the network deepens. ResNet (residual neural network) is a CNN-based deep artificial neural network that makes training easy, even in deep neural networks using a residual learning framework, through the use of residuals [55].
In this study, the image in the video is analyzed with a deep convolutional neural network (DCNN); the larger and more complex the image, the deeper the neural network. Therefore, ResNet, which shows excellent performance even in deep neural networks, is used.
The face data , , , , , , ⋯ , , , for user i in Table 4, which is the amount of face change over time, correspond to the rating ri,j in Table 2. ResNet and neural network models that predict ratings using face changes over time and rating data as inputs are trained as depicted in Figure 5. For the time t to be predicted, the average of the prediction results from time1 to timet through ResNet and the neural network is defined as the prediction rating for time = t, and is expressed as Equation (2)

CNN-Based Rating Prediction Model
In deep learning, the deeper the neural network (NN), the better the performance; however, it is difficult to train. In particular, a gradient vanishing problem and a gradient exploding problem cause a problem in which the training error increases as the network deepens. ResNet (residual neural network) is a CNN-based deep artificial neural network that makes training easy, even in deep neural networks using a residual learning framework, through the use of residuals [55].
In this study, the image in the video is analyzed with a deep convolutional neural network (DCNN); the larger and more complex the image, the deeper the neural network. Therefore, ResNet, which shows excellent performance even in deep neural networks, is used.
The face data F i,j,1 , F i,j,2 , · · · , F i,j,T for user i in Table 4, which is the amount of face change over time, correspond to the rating r i,j in Table 2. ResNet and neural network models that predict ratings using face changes over time and rating data as inputs are trained as depicted in Figure 5. For the time t to be predicted, the average of the prediction results from time 1 to time t through ResNet and the neural network is defined as the prediction rating for time = t, and is expressed as Equation (2): (2) Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 19

Keypoint Score-Based Recommendation Model
Through the SIFT algorithm, keypoint of Fi,j,t-1 and Fi,j,t are extracted, and keypoint matching is performed based on the extracted keypoint. Here, the number of matched keypoint is defined as keypoint score (KPS). Figure 6 illustrates the KPS calculation process from Fi,j,t-1 and Fi,j,t. Keypoint scores on each time for item j are shown in Table 5.

Keypoint Score-Based Recommendation Model
Through the SIFT algorithm, keypoint of F i,j,t−1 and F i,j,t are extracted, and keypoint matching is performed based on the extracted keypoint. Here, the number of matched keypoint is defined as keypoint score (KPS). Figure 6 illustrates the KPS calculation process from F i,j,t−1 and F i,j,t . Keypoint scores on each time for item j are shown in Table 5.

Keypoint Score-Based Recommendation Model
Through the SIFT algorithm, keypoint of Fi,j,t-1 and Fi,j,t are extracted, and keypoint matching is performed based on the extracted keypoint. Here, the number of matched keypoint is defined as keypoint score (KPS). Figure 6 illustrates the KPS calculation process from Fi,j,t-1 and Fi,j,t. Keypoint scores on each time for item j are shown in Table 5.
The similarity between users from time 1 to time T for item i is calculated using the cosine similarity of KPS data. Figure 7 illustrates the searching process of similar users, where user 3 is determined as similar to user 2. Therefore, in this study, the top N advertisements preferred by users with the highest similarity were recommended for the target user. If user 2 is a target user, then the advertisements preferred by user 3 (advertiment with max r 3,j ) are recommended to them. This study searches for similar users by analyzing the similarity of KPS, and recommends items to users in real time based on similar users. Cosine similarity and Pearson similarity are usually used to obtain similarities between the vectors of two users [5,11,56]. This study uses cosine similarity. The cosine similarity between user m and user n is calculated in Equation (3) The similarity between users from time 1 to time T for is calculated using the cosine similarity of KPS data. Figure 7 illustrates the searching process of similar users, where user 3 is determined as similar to user 2. Therefore, in this study, the top N advertisements preferred by users with the highest similarity were recommended for the target user. If user 2 is a target user, then the advertisements preferred by user 3 ( ℎ 3, are recommended to them.

Benchmark Model
As benchmark models to compare the CNN-based model suggested in this study, ① CF-based, ② average rating (best-selling), and ③ random recommender system are used. As the objective of this study is to recommend advertising videos by comparing data between users, a typical CF is used, in which similarity is measured based on the ratings given by two users. A random recommender system and an average rating-based system (which uses the average of all other users' ratings, excluding a target user's rating for the recommendation) are used as benchmark systems. A random recommender system is used to show that the proposed recommendation approach is more profitable. If there is no difference between the accuracy of the random recommendation and that of the proposed recommendation, then the latter is not acceptable, whatever the measure of accuracy. An average rating-based system is used to show that the proposed approach outperforms a simple best-selling approach. Note that the average rating approach is essentially a form of best-selling recommendation algorithm adapted for the online video recommendation problem. In particular, a notable point is that the proposed approach can recommend advertising videos without distinguishing between new and existing users; this is because the proposed method does not use a user's past purchase history or rating records, but instead recommends advertising videos using only facial changes. However,

Benchmark Model
As benchmark models to compare the CNN-based model suggested in this study, 1 CF-based, 2 average rating (best-selling), and 3 random recommender system are used. As the objective of this study is to recommend advertising videos by comparing data between users, a typical CF is used, in which similarity is measured based on the ratings given by two users. A random recommender system and an average rating-based system (which uses the average of all other users' ratings, excluding a target user's rating for the recommendation) are used as benchmark systems. A random recommender system is used to show that the proposed recommendation approach is more profitable. If there is no difference between the accuracy of the random recommendation and that of the proposed recommendation, then the latter is not acceptable, whatever the measure of accuracy. An average rating-based system is used to show that the proposed approach outperforms a simple best-selling approach. Note that the average rating approach is essentially a form of best-selling recommendation algorithm adapted for the online video recommendation problem. In particular, a notable point is that the proposed approach can recommend advertising videos without distinguishing between new and existing users; this is because the proposed method does not use a user's past purchase history or rating records, but instead recommends advertising videos using only facial changes. However, as CF can only recommend advertising videos to users with existing data, it is only used to compare the performance of recommendations for existing users.

Datasets
To evaluate the performance of the methodology proposed in this study, we collected user facial expression data from 1 May 2020 to 31 December 2020. We prepared 11 advertisement videos to collect the facial expressions of users watching the video. The advertising video used in the experiment was a food advertisement, and the average length of one advertising video was 20.1 s (minimum 11 s and maximum 30 s). The users watched each advertisement we prepared and gave ratings on a five-point Likert scale (1 = strongly dislike, 2 = dislike, 3 = neutral, 4 = like, and 5 = strongly like). While watching the advertisements, the users' facial expressions were collected in real-time, and we performed face recognition and face image extraction through a deep learning approach. A total of 77 users participated in the experiment of this study. The demographic information of the users who participated in the experiment of this study is shown in Table 6.

Experiment Design
For the evaluation of the CNN-based advertisement recommender system, two experiments were performed. In the first experiment, the accuracy of the CNN-based rating prediction model was measured. The extracted user's face image was represented as a matrix in 3D, and the difference between the current image matrix and the previous image matrix was obtained; this is defined as a face change image. A CNN-based rating prediction model was trained on the obtained face change image over time to construct a rating prediction model, as explained in Section 3.3. As the dataset of this study is small, we use the LOOCV (leave-one-out cross validation) method, which showed good performance for learning fewer data. LOOCV is a method for evaluating the performance of a model using one sample as test data and the other n − 1 as training data from n data samples [57]. Therefore, there are n validations for one epoch, and the average of the validation accuracy is set as the accuracy of the model.
To measure the accuracy of the CNN-based rating prediction model, this study used mean absolute error (MAE). MAE has been used for the measurement and comparison of average performance errors of a model, and has the advantage of being robust to outliers [6,7,57,58]. The MAE is calculated as in Equation (4): In the second experiment, the accuracy of the advertisement recommendation model for new users was measured. The users' keypoint score (KPS) by time was calculated through the proposed SIFT algorithm-based KPS calculation method using the extracted user's face image. Similar users to a target user were searched by obtaining a cosine similarity of each user's KPS for a specific time, as shown in Equation (3). The advertisement recommender system recommends Top K advertisements for the target user, which are selected based on similar users' preferred advertisements. To measure the accuracy of the advertisement recommender system, this study used a recommendation hit ratio (RHR), defined as shown in Equation (5).
where a user's Top-K is defined as the top Top-K advertisements in which the advertisements preferred by the target user are arranged in descending order.

Experiment Result 1: CNN-Based Rating Prediction Model
To evaluate the performance of the CNN-based rating prediction model for new users, an average rating prediction system and a random rating prediction system were constructed as a benchmark system. As CF is a recommendation method based on historical data, it cannot be used as a benchmark system that evaluates new users in real time. When evaluating existing users, it is used as a benchmark system.
For the difference between the actual rating and the predicted rating by time, the evaluation index was used as MAE, and the average MAE over the entire time was considered as the performance of the model. The averages of the MAE for the CNN-based rating prediction system, the random rating system, and the average rating-based system were 0.756, 1.584, and 0.935, respectively. The CNN-based system was better than other systems in terms of preference rating predictions regarding advertising videos for the new users. More details are presented in Figure 8.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 19 recommender system recommends Top K advertisements for the target user, which are selected based on similar users' preferred advertisements. To measure the accuracy of the advertisement recommender system, this study used a recommendation hit ratio (RHR), defined as shown in Equation (5).
where a user's Top-K is defined as the top Top-K advertisements in which the advertisements preferred by the target user are arranged in descending order.

Experiment Result 1: CNN-Based Rating Prediction Model
To evaluate the performance of the CNN-based rating prediction model for new users, an average rating prediction system and a random rating prediction system were constructed as a benchmark system. As CF is a recommendation method based on historical data, it cannot be used as a benchmark system that evaluates new users in real time. When evaluating existing users, it is used as a benchmark system.
For the difference between the actual rating and the predicted rating by time, the evaluation index was used as MAE, and the average MAE over the entire time was considered as the performance of the model. The averages of the MAE for the CNN-based rating prediction system, the random rating system, and the average rating-based system were 0.756, 1.584, and 0.935, respectively. The CNN-based system was better than other systems in terms of preference rating predictions regarding advertising videos for the new users. More details are presented in Figure 8. To compare the performances of existing users, a CF-based rating prediction system was added and analyzed. The analysis results are shown in Figure 9. The averages of the MAE values for the CNN-based rating prediction system, the random recommender system, the average rating-based system, and the CF-based system were 0.756, 1.584, 0.935, and 0.946, respectively. The CNN-based system was better than other systems in rating the prediction of advertising videos for existing users. However, in the case of video3, video6, and video9, the performance of the CF-based system was better than that of the CNN-based system. Furthermore, the CF-based system shows a more stable recommen- To compare the performances of existing users, a CF-based rating prediction system was added and analyzed. The analysis results are shown in Figure 9. The averages of the MAE values for the CNN-based rating prediction system, the random recommender system, the average rating-based system, and the CF-based system were 0.756, 1.584, 0.935, and 0.946, respectively. The CNN-based system was better than other systems in rating the prediction of advertising videos for existing users. However, in the case of video3, video6, and video9, the performance of the CF-based system was better than that of the CNN-based system. Furthermore, the CF-based system shows a more stable recommendation performance, regardless of the video type, than the suggested CNN-based system. Whereas the CF-based system used data evaluated and stored by existing users, the CNN-based system without existing user data predicted the degree of video preference by facial expressions only; thus, it was considered relatively unstable. For performance comparison, experiments were performed assuming there were existing data. However, as our proposed CNN-based system estimates a user's rating value by detecting their face changes in real time, it is impossible to include the CF-based system in our analysis.

Experiment Result 2: Advertisement Recommendation Model
As with the rating prediction model, the performance evaluation for a real-time KPSbased advertisement recommender system is divided into two categories: situations for (1) new users and (2) existing users. For new users, best-selling and random recommender systems were used as benchmark systems, and a CF-based recommender system was added for existing users. As for the performance evaluation index, the @ of Equation (5), as defined above, was used. Several experiments were performed regarding the various numbers of recommended (Top-K) videos, varying from 1 to 11. The performance from Top-1 to Top-11 was compared; the average result of @ among all the users is shown in Figure 10. As shown in Figure 10, it can be seen that the performance of the KPS-based recom- For performance comparison, experiments were performed assuming there were existing data. However, as our proposed CNN-based system estimates a user's rating value by detecting their face changes in real time, it is impossible to include the CF-based system in our analysis.

Experiment Result 2: Advertisement Recommendation Model
As with the rating prediction model, the performance evaluation for a real-time KPSbased advertisement recommender system is divided into two categories: situations for (1) new users and (2) existing users. For new users, best-selling and random recommender systems were used as benchmark systems, and a CF-based recommender system was added for existing users. As for the performance evaluation index, the RHR@K of Equation (5), as defined above, was used. Several experiments were performed regarding the various numbers of recommended (Top-K) videos, varying from 1 to 11. The performance from Top-1 to Top-11 was compared; the average result of RHR@K among all the users is shown in Figure 10.
As shown in Figure 10, it can be seen that the performance of the KPS-based recommender system is clearly higher than that of the best-selling or random recommender systems. Because it is a case of stopping an advertising video currently being viewed on a smartphone or kiosk and showing another advertising video, it is reasonable to see only the performance evaluation when actually recommending one advertising video. In this case, it can be seen that the performance of the KPS-based recommender system is overwhelmingly higher than the benchmark systems. Moreover, the KPS-based system showed improved robustness over the varying recommendation list sizes. Accordingly, we find that facial expressions are critical factors in recommendations to new users, and we claim that the KPS-based system addresses the new user problem. To compare the performance of existing users, a performance analysis was conducted by adding a CF-based recommender system. The analysis results are shown in Figure 11.
(1) new users and (2) existing users. For new users, best-selling and random recommender systems were used as benchmark systems, and a CF-based recommender system was added for existing users. As for the performance evaluation index, the @ of Equation (5), as defined above, was used. Several experiments were performed regarding the various numbers of recommended (Top-K) videos, varying from 1 to 11. The performance from Top-1 to Top-11 was compared; the average result of @ among all the users is shown in Figure 10. As shown in Figure 10, it can be seen that the performance of the KPS-based recommender system is clearly higher than that of the best-selling or random recommender systems. Because it is a case of stopping an advertising video currently being viewed on a smartphone or kiosk and showing another advertising video, it is reasonable to see only the performance evaluation when actually recommending one advertising video. In this case, it can be seen that the performance of the KPS-based recommender system is overwhelmingly higher than the benchmark systems. Moreover, the KPS-based system showed improved robustness over the varying recommendation list sizes. Accordingly, we find that facial expressions are critical factors in recommendations to new users, and we claim that the KPS-based system addresses the new user problem. To compare the performance of existing users, a performance analysis was conducted by adding a CFbased recommender system. The analysis results are shown in Figure 11. As a result of our performance analysis, the performance of the CF-based system was better than that of the KPS-based system when the recommendation list size was three or more. However, it is reasonable to view only the performance evaluation of Top-1, as the main problem in this study is the case of stopping the currently viewed advertising video on a smartphone or kiosk and showing another advertising video. In this case, it can be seen that our proposed system performs much better. The averages of the @1 values for the KPS-based system, the random recommender system, the CF-based system, and the best-selling system were 0.348, 0.078, 0.195, and 0.221, respectively. The KPS-based system outperformed other systems in recommending videos for existing users.

Discussion and Conclusions
As computer vision and information technology advance, an environment has been established to collect various types of data such as facial expressions and purchase history. However, most online video recommender systems have built dynamic user profiles by capturing facial expressions through heuristic techniques. Such an approach is challenging when it comes to recommending advertisement videos in real time to new users without past video historical data. In this study, we propose a novel recommender system using computer vision and a deep learning approach to improve the limitations of existing video recommender systems. Thus, we developed a CNN-based prediction system to predict how a user enjoys an advertising video by recognizing their facial expressions. We also applied SIFT to recommend advertisement videos to new users to search for users with similar preferences in real time.
The experiment results are as follows. First, the proposed CNN-based rating predic- Figure 11. The performance evaluation of recommender systems for existing users.
As a result of our performance analysis, the performance of the CF-based system was better than that of the KPS-based system when the recommendation list size was three or more. However, it is reasonable to view only the performance evaluation of Top-1, as the main problem in this study is the case of stopping the currently viewed advertising video on a smartphone or kiosk and showing another advertising video. In this case, it can be seen that our proposed system performs much better. The averages of the RHR@1 values for the KPS-based system, the random recommender system, the CF-based system, and the best-selling system were 0.348, 0.078, 0.195, and 0.221, respectively. The KPS-based system outperformed other systems in recommending videos for existing users.

Discussion and Conclusions
As computer vision and information technology advance, an environment has been established to collect various types of data such as facial expressions and purchase history. However, most online video recommender systems have built dynamic user profiles by capturing facial expressions through heuristic techniques. Such an approach is challenging when it comes to recommending advertisement videos in real time to new users without past video historical data. In this study, we propose a novel recommender system using computer vision and a deep learning approach to improve the limitations of existing video recommender systems. Thus, we developed a CNN-based prediction system to predict how a user enjoys an advertising video by recognizing their facial expressions. We also applied SIFT to recommend advertisement videos to new users to search for users with similar preferences in real time.
The experiment results are as follows. First, the proposed CNN-based rating prediction system outperforms other systems in the preference rating predictions of advertising videos for new users and users with video records. The proposed CNN-based recommender systems predict user preferences based on real-time user facial expressions. Although the recommendation performance for new users varies depending on the video type, the overall predictive performance is excellent. However, the proposed method cannot offer a stable recommended performance considering the overall historical data. That is, when considering the overall historical data, the CF technique can be used more efficiently. Accordingly, we find that facial expressions are critical factors in recommendations to new users. Second, the proposed KPS-based advertisement recommender system indicates excellent performance when recommending large-size recommendation lists to both new and existing users. That is, when the recommended list size is small, the CF-based system shows a more stable recommendation performance. Therefore, we can determine that it is more suitable to use data such as click history than real-time user facial expression data when providing a small size recommendation list. Nevertheless, we found it more effective to recommend advertisements based on real-time user facial expressions when several advertisements were recommended.
The academic and practical implications of this study are as follows. First, we proposed a methodology for recommending online advertising videos through a deep learning approach. In previous studies, when measuring user facial expressions, these were collected through heuristic techniques, while we elaborately extracted user facial expressions by applying SIFT deep learning techniques. Similar users are calculated based on such facial expression characteristics, and customized advertisements are recommended in real time. We have contributed to the expansion of research areas related to recommender systems through this study. Second, existing recommender system research mainly uses memory-based CF techniques. However, the memory-based CF technique is a lazy learning technique that produces a result through a heuristic technique whenever a recommendation is required without building a model. As the types of online videos have become diverse, and the number of users has increased, the data size has increased, and the existing algorithms have consumed a lot of time and resources. In online advertising videos, in particular, it is necessary to quickly calculate and find recommendation items using real-time information and finally make personalized recommendations for each user. This study expands research on recommender systems by building a CNN-based model that recommends online advertising videos to users in real time after constructing a model using actual data. Third, the recommender system manager must accurately establish the user type and select a recommendation method suitable for a company's strategy. Through experiments, we proved that the method proposed in this study is perfect for new users. However, our results showed that the CF-based method was more stable for users with historical data. Therefore, a manager should prepare a strategy for a recommendation method according to the user type and provide a personalized advertisement service. Fourth, when the proposed model provides more recommendation lists, managers need to prepare various advertisement videos because the recommendation performance is excellent. The proposed model can precisely grasp a user's preference while capturing their facial expressions in real time. According to the experimental results of this study, a manager needs to set the recommendation list as extensively as possible to grasp users' preferences.
However, there are several limitations and future research topics apparent in this study. The limitations and future research topics are listed as follows. First, the experiments were performed with a relatively small dataset, so the learning was insufficient. Owing to the nature of the image, the data should be significant, and learning should be repeated many times. However, this study did not obtain many data during data collection. Therefore, this represents an excellent future research topic regarding the continuous collection of data to reveal the relationship between data and the according change in performance. Second, it takes a long time to process images, so the learning time was extensive. Improving learning time efficiency by optimizing the learning structure in the future also represents a suitable research topic. For example, SIFT used in this study as a feature detection and image matching algorithm can be compared with SURF, BRISK, and so on. Additionally, data augmentation optimization for efficient learning with few data will be a necessary research topic. Finally, in this study, facial expressions were used to evaluate users' satisfaction with advertising videos; however, the same technology could be used for many other domains or purposes. For example, it can predict whether a customer searching for a particular product in a store is likely to purchase that product. Applying the idea of predicting customer purchase intentions in physical stores and metaverse platform situations will also be interesting for future research.