An Empathy Evaluation System Using Spectrogram Image Features of Audio

Watching videos online has become part of a relaxed lifestyle. The music in videos has a sensitive influence on human emotions, perception, and imaginations, which can make people feel relaxed or sad, and so on. Therefore, it is particularly important for people who make advertising videos to understand the relationship between the physical elements of music and empathy characteristics. The purpose of this paper is to analyze the music features in an advertising video and extract the music features that make people empathize. This paper combines both methods of the power spectrum of MFCC and image RGB analysis to find the audio feature vector. In spectral analysis, the eigenvectors obtained in the analysis process range from blue (low range) to green (medium range) to red (high range). The machine learning random forest classifier is used to classify the data obtained by machine learning, and the trained model is used to monitor the development of an advertisement empathy system in real time. The result is that the optimal model is obtained with the training accuracy result of 99.173% and a test accuracy of 86.171%, which can be deemed as correct by comparing the three models of audio feature value analysis. The contribution of this study can be summarized as follows: (1) the low-frequency and high-amplitude audio in the video is more likely to resonate than the high-frequency and high-amplitude audio; (2) it is found that frequency and audio amplitude are important attributes for describing waveforms by observing the characteristics of the machine learning classifier; (3) a new audio extraction method is proposed to induce human empathy. That is, the feature value extracted by the method of spectrogram image features of audio has the most ability to arouse human empathy.


Introduction
Empathy, a phenomenon of characterizing our understanding and sharing of others' feelings, is vital to our everyday communication and survival in a social environment [1]. Empathy evaluation has many methods but can be mainly classified into three categories: subjective size evaluation, image processing, and bio-signals [2]. In image processing, a four-camera vision system was established to sample a target from different perspectives. In this system, a global calibration technique was deployed to correlate each individual system. After that, the local point clouds were extracted, filtered, and stitched [3]. The subjective evaluation method usually uses a questionnaire but has limitations due to subjective ambiguity and individual differences. In order to compensate for its limitations, researchers also use objective methods to evaluate empathy.
Immersive environments can induce emotional changes capable of generating states of empathy. Considering an immersive environment as a socio-technical system, human and non-human elements could interact and could also establish strong relationships [4]. In video content, audio sensitively affects human emotions, perceptions, and imagination [5], which can relieve tension or create the feeling of sadness. Therefore, it is important for video creators, such as advertising designers, to understand the relationship between the By analyzing the obtained feature we determine that feature values are low power, medium power, and high power in the high-frequency domain and low power, medium power, and high power in the middle-frequency domain. Moreover, it is low power and high power in the low frequency that can distinguish empathy.
Six classifiers of machine learning methods were used to learn and classify the collected feature values for getting the highest accuracy rate: boosting-tree classifier, decision tree classifier, MLP (multilayer perceptron), random forest classifier, KNN, and SVM. As a result, the correct training rates of the random forest classifier are 99.173% and 86.171%, respectively. However, the highest rate is obtained from the RGB spectrogram filtered by MFCC, a hybrid model of the three feature extraction methods. It can be seen from the results that the empathy ability also gradually increases as the power increases. However, the ability to discriminate empathy characteristics gradually weakens as the frequency domain increases.
The contributions of this study can be summarized as follows: (1) compared with high-frequency and high-amplitude audios, low-frequency, and high-amplitude audios are more likely to make people empathize with the video; (2) by observing the characteristics of the machine learning classifier, it is found that the frequency and amplitude of the audio are important attributes to describe the wave; (3) a new audio extraction method is proposed to induce human empathy. That is, feature values extracted by the method of spectrogram image features of audio has the most ability to arouse human empathy. All the codes are saved in this link: https://github.com/zjing13cc/audio-processing (accessed on 13 October 2021).

Experiment Procedure
We conducted two experiments for the content of this research, namely experiment one in which there are 30 subjects watching 24 advertisements and experiment two with 77 subjects and 8 movie clips.

Subject
There are 30 subjects (15 males and 15 females) who are aged 20-30 years in experiment one, and 77 subjects (38 males and 39 females) aged 20-40 years old in experiment two. All the subjects were asked to have adequate sleep and no caffeine before the experiment and were paid $20 for participation in the whole experimental process. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Sangmyung University, Seoul, Korea (BE2018-35).

Stimuli
To create an emotional stimulus and induce empathy, this study chose a new stimulus specially designed for this research. According to different emotions, the degree of empathy may vary. In previous research on empathy, media content that triggers emotions can cause empathy more than media content that attracts rationality, and storytelling skills can cause empathy more than enumerating facts or statistical evidence [22][23][24]. Therefore, this study collected YouTube video clips that content providers hoped would convey emotion through advertisements, to induce empathy and content. On the assumption that the audience's empathy and the effects of the advertising would be high, the researcher selected 50 advertisements with high reviews, high ratings, and high actual sales, and six subjects were investigated in preliminary experiments. Following this, 24 online videos were screened during the first experiment. Regarding the number of times that the same product's advertisement was played at different times, 24 advertisements were selected from Baxter, Vita 500, Samsung Printer (MLT-D 115), and other products, which are shown in Figure 1. Videos marked with a red box are empathetic, and vice versa are non-empathetic. The plots of TV dramas or movies trigger empathy more easily. In order to dig deeper into empathetic elements, we screened the best 8 TV shows and movie clips with high scores in subjective appraisals from 20 of them. They came from clips in Korean TV dramas or movies, as seen in Figure 2. Videos marked with a red box are empathetic videos, and vice versa are non-empathetic videos. which are shown in Figure 1. Videos marked with a red box are empathetic, and vice versa are non-empathetic. The plots of TV dramas or movies trigger empathy more easily. In order to dig deeper into empathetic elements, we screened the best 8 TV shows and movie clips with high scores in subjective appraisals from 20 of them. They came from clips in Korean TV dramas or movies, as seen in Figure 2. Videos marked with a red box are empathetic videos, and vice versa are non-empathetic videos.

Feature Analysis
This paper used three methods to analyze the emotion of audio: (1) using the MFCC of the human ear pattern filter to extract feature values; (2) using the spectrogram image feature method without pre-processing; and (3) method of combining the above two methods. By comparing the characteristic values extracted by the three methods on empathetic audio and non-empathetic audio shown in Figure 14, the consistent and optimal solution was obtained in the result. The highest training accuracy rate was obtained using machine learning methods.

Mel-Frequency Cepstral Coefficients (MFCCs) Method
The MFCC (mel-frequency cepstral coefficient) is an algorithm that characterizes voice data. You can think of mel as a value derived from the human cochlea. Humans perceive audio through the cochlea. Although the cochlea is curled each part of the coch- which are shown in Figure 1. Videos marked with a red box are empathetic, and vice versa are non-empathetic. The plots of TV dramas or movies trigger empathy more easily. In order to dig deeper into empathetic elements, we screened the best 8 TV shows and movie clips with high scores in subjective appraisals from 20 of them. They came from clips in Korean TV dramas or movies, as seen in Figure 2. Videos marked with a red box are empathetic videos, and vice versa are non-empathetic videos.

Feature Analysis
This paper used three methods to analyze the emotion of audio: (1) using the MFCC of the human ear pattern filter to extract feature values; (2) using the spectrogram image feature method without pre-processing; and (3) method of combining the above two methods. By comparing the characteristic values extracted by the three methods on empathetic audio and non-empathetic audio shown in Figure 14, the consistent and optimal solution was obtained in the result. The highest training accuracy rate was obtained using machine learning methods.

Mel-Frequency Cepstral Coefficients (MFCCs) Method
The MFCC (mel-frequency cepstral coefficient) is an algorithm that characterizes voice data. You can think of mel as a value derived from the human cochlea. Humans perceive audio through the cochlea. Although the cochlea is curled each part of the coch-

Feature Analysis
This paper used three methods to analyze the emotion of audio: (1) using the MFCC of the human ear pattern filter to extract feature values; (2) using the spectrogram image feature method without pre-processing; and (3) method of combining the above two methods. By comparing the characteristic values extracted by the three methods on empathetic audio and non-empathetic audio shown in Figure 14, the consistent and optimal solution was obtained in the result. The highest training accuracy rate was obtained using machine learning methods.

Mel-Frequency Cepstral Coefficients (MFCCs) Method
The MFCC (mel-frequency cepstral coefficient) is an algorithm that characterizes voice data. You can think of mel as a value derived from the human cochlea. Humans perceive audio through the cochlea. Although the cochlea is curled each part of the cochlea senses different frequencies (frequency) when it is stretched out. Based on the frequency detected by the cochlea humans perceive audio, hence this frequency is used as a feature. However, Cochleas have special properties. The frequency change is well detected in the lower frequency band compared to the higher frequency band. The part of the cochlea is thick when detecting the low-frequency band. However, it becomes thinner when it detects the high-frequency band. Rather than just using the frequency as a feature vector, it would be a more effective way to select features that match the characteristics of the cochlea. A standard set of features for music information are described in the retrieval literature [25,26], including low-level features (RMS level, Bark bands), spectral features (energy in low/middle/high frequencies, spectral centroid, spectral spread, spectral skewness, spectral kurtosis, spectral roll off, spectral crest, spectral flux, spectral complexity), timbral features (mel-frequency cepstral coefficients, t ristimulus), and melodic features (pitch, pitch salience and confidence computed with the YIN algorithm [27], inharmonicity, dissonance) [28][29][30][31] and are shown in Table 1.

Temporal
Zero-crossing profile, key clarity, harmonic change, musical mode.
The quantitative evaluation method of empathy is to analyze the original data in the frequency domain by using the extracted audio features. The power spectrum is extracted from the frequency domain and the frequency components are extracted from a plurality of frequency bands having different frequencies in the frequency spectrum. A single color is then applied to the signals from the plurality of bands to convert the signals into RGB image data; the RGB image data is then stored as learning data, generating a model file, including weights, trained by learning using the learning data. For comparative voice data extracted separately from the input video, the convolutional neural network technology is applied using the training weights of the above model files to compare whether the video is empathetic or uninteresting. We apply a triangular filter at this stage, extracting frequency components from multiple bands with different frequencies on the above power spectrum, and we can apply discrete cosine transform (DCT) to the frequency components of multiple bands [32]. The flow of an MFCC feature extraction method is shown in Figure 3.
lea senses different frequencies (frequency) when it is stretched out. Based on the frequency detected by the cochlea humans perceive audio, hence this frequency is used as a feature. However, Cochleas have special properties. The frequency change is well detected in the lower frequency band compared to the higher frequency band. The part of the cochlea is thick when detecting the low-frequency band. However, it becomes thinner when it detects the high-frequency band. Rather than just using the frequency as a feature vector, it would be a more effective way to select features that match the characteristics of the cochlea. A standard set of features for music information are described in the retrieval literature [25,26], including low-level features (RMS level, Bark bands), spectral features (energy in low/middle/high frequencies, spectral centroid, spectral spread, spectral skewness, spectral kurtosis, spectral roll off, spectral crest, spectral flux, spectral complexity), timbral features (mel-frequency cepstral coefficients, t ristimulus), and melodic features (pitch, pitch salience and confidence computed with the YIN algorithm [27], inharmonicity, dissonance) [28][29][30][31] and are shown in Table 1.
The quantitative evaluation method of empathy is to analyze the original data in the frequency domain by using the extracted audio features. The power spectrum is extracted from the frequency domain and the frequency components are extracted from a plurality of frequency bands having different frequencies in the frequency spectrum. A single color is then applied to the signals from the plurality of bands to convert the signals into RGB image data; the RGB image data is then stored as learning data, generating a model file, including weights, trained by learning using the learning data. For comparative voice data extracted separately from the input video, the convolutional neural network technology is applied using the training weights of the above model files to compare whether the video is empathetic or uninteresting. We apply a triangular filter at this stage, extracting frequency components from multiple bands with different frequencies on the above power spectrum, and we can apply discrete cosine transform (DCT) to the frequency components of multiple bands [32]. The flow of an MFCC feature extraction method is shown in Figure 3.     The pre-emphasis filter is useful in many ways, and it can be applied to the signal x using the first-order filter in the following equation.
The previous work has found that typical values of the filter coefficient (α) is 0.95 or 0.97 [33]. The coefficient is selected as 0.97 in this audio processing. Figures 6 and 7 show the time domain and frequency domain signals after pre-emphasis filtering.   The pre-emphasis filter is useful in many ways, and it can be applied to the signal x using the first-order filter in the following equation.
The previous work has found that typical values of the filter coefficient (α) is 0.95 or 0.97 [33]. The coefficient is selected as 0.97 in this audio processing. Figures 6 and 7 show the time domain and frequency domain signals after pre-emphasis filtering. The pre-emphasis filter is useful in many ways, and it can be applied to the signal x using the first-order filter in the following equation.
The previous work has found that typical values of the filter coefficient (α) is 0.95 or 0.97 [33]. The coefficient is selected as 0.97 in this audio processing. Figures 6 and 7 show the time domain and frequency domain signals after pre-emphasis filtering.  In the second stage, the fast Fourier transform of the neural network points is performed on each frame to calculate the frequency spectrum, also called short-time Fourier transform (STFT), in which the neural network is usually 256 or 512 [34]. We set NFFT as 512, and then calculate the power spectrum (periodogram) using the following equation in the process. When the number of operation points is an integer power of 2, let x(i) be a finite-length sequence of N points, that is, if there is a value within 0 ≤ n ≤ N − 1, then define x(i) with N points. The xi is the frame of the x-signal as shown in Figure 8, after executing this process, the result can be obtained.

=
(2)  In the second stage, the fast Fourier transform of the neural network points is performed on each frame to calculate the frequency spectrum, also called short-time Fourier transform (STFT), in which the neural network is usually 256 or 512 [34]. We set NFFT as 512, and then calculate the power spectrum (periodogram) using the following equation in the process. When the number of operation points is an integer power of 2, let x(i) be a finite-length sequence of N points, that is, if there is a value within 0 ≤ n ≤ N − 1, then define x(i) with N points. The xi is the frame of the x-signal as shown in Figure 8, after executing this process, the result can be obtained.

=
(2) In the second stage, the fast Fourier transform of the neural network points is performed on each frame to calculate the frequency spectrum, also called short-time Fourier transform (STFT), in which the neural network is usually 256 or 512 [34]. We set NFFT as 512, and then calculate the power spectrum (periodogram) using the following equation in the process. When the number of operation points is an integer power of 2, let x(i) be a finite-length sequence of N points, that is, if there is a value within 0 ≤ n ≤ N − 1, then define x(i) with N points. The x i is the frame of the x-signal as shown in Figure 8, after executing this process, the result can be obtained. The last step in calculating the filter bank is to apply a triangular filter to the power spectrum. The mel scale simulates the perception of non-linear human ears to audio as audio is more recognizable to human ears at lower frequencies than at higher frequencies.
We can switch between hertz (f) and mel (m) using the following equation. The mel filter and mel inverse filter are shown in Figure 9. The last step in calculating the filter bank is to apply a triangular filter to the power spectrum. The mel scale simulates the perception of non-linear human ears to audio as audio is more recognizable to human ears at lower frequencies than at higher frequencies.
We can switch between hertz (f) and mel (m) using the following equation. The mel filter and mel inverse filter are shown in Figure 9. The last step in calculating the filter bank is to apply a triangular filter to the power spectrum. The mel scale simulates the perception of non-linear human ears to audio as audio is more recognizable to human ears at lower frequencies than at higher frequencies.
We can switch between hertz (f) and mel (m) using the following equation. The mel filter and mel inverse filter are shown in Figure 9.  The discrete cosine transform modifies the filter bank coefficients and generates a compressed representation of the filter banks. Typically, the resulting cepstral coefficients 2-13 are retained and the rest are discarded for automatic speech recognition (ASR). Here 12 values are chosen for num ceps. The mel spectrum after this discrete cosine transform process is shown in Figure 10.
The discrete cosine transform modifies the filter bank coefficients and generates a compressed representation of the filter banks. Typically, the resulting cepstral coefficients 2-13 are retained and the rest are discarded for automatic speech recognition (ASR). Here 12 values are chosen for num ceps. The mel spectrum after this discrete cosine transform process is shown in Figure 10. In this paper, audio files are extracted from the advertisement, and then the frequency spectrum of the audio file is obtained using the MFCC. When calculating the power spectral density on the decibel power scale, the width of the Hamming window is 4.15 s, and the size of the sliding window is 50 ms. The median size of the audio signal spectrogram is 371 × 501 pixels.

Spectrogram RGB Image Feature Method
Analysis is carried out by transforming the sampling rate in short blocks of audio of 1 s duration and then calculating the RGB images of spectrograms. These images are used to fine tune the classification of the pre-trained image.
The representation of spectrogram on four different frequency scales, linearity, melody, equivalent rectangular bandwidth (ERB), and logarithm, allows the effects of high, medium, and low-frequency characteristics of the audio to be observed separately. The use of red (R), green (G), or blue (B) components of RGB images showed the importance of high, medium, and low-amplitude audio components, respectively. The algorithm is shown in Figure 11.  In this paper, audio files are extracted from the advertisement, and then the frequency spectrum of the audio file is obtained using the MFCC. When calculating the power spectral density on the decibel power scale, the width of the Hamming window is 4.15 s, and the size of the sliding window is 50 ms. The median size of the audio signal spectrogram is 371 × 501 pixels.

Spectrogram RGB Image Feature Method
Analysis is carried out by transforming the sampling rate in short blocks of audio of 1 s duration and then calculating the RGB images of spectrograms. These images are used to fine tune the classification of the pre-trained image.
The representation of spectrogram on four different frequency scales, linearity, melody, equivalent rectangular bandwidth (ERB), and logarithm, allows the effects of high, medium, and low-frequency characteristics of the audio to be observed separately. The use of red (R), green (G), or blue (B) components of RGB images showed the importance of high, medium, and low-amplitude audio components, respectively. The algorithm is shown in Figure 11. In this paper, audio files are extracted from the advertisement, and quency spectrum of the audio file is obtained using the MFCC. When c power spectral density on the decibel power scale, the width of the Hamm 4.15 s, and the size of the sliding window is 50 ms. The median size of th spectrogram is 371 × 501 pixels.

Spectrogram RGB Image Feature Method
Analysis is carried out by transforming the sampling rate in short bloc 1 s duration and then calculating the RGB images of spectrograms. These im to fine tune the classification of the pre-trained image.
The representation of spectrogram on four different frequency scales, ody, equivalent rectangular bandwidth (ERB), and logarithm, allows the e medium, and low-frequency characteristics of the audio to be observed s use of red (R), green (G), or blue (B) components of RGB images showed t of high, medium, and low-amplitude audio components, respectively. Th shown in Figure 11.  The algorithm is a method to analyze the color mapping of three "standard" spectrograms in order to find a good representation method of the sample [35]. The first picture is the default color picture of MFCC, ranging from blue (low range) to green (middle range) to red (high range), grayscale is a sequential grayscale mapping from black (low range) to gray (middle range) to white (high range), and finally, Stolar, M.N. et al. [36]. The RGB pictures are uniform sequential color pictures from blue (d) to green (c) to red (b). They are divided into the images shown in Figure 12.
The algorithm is a method to analyze the color mapping of three "standard" spectrograms in order to find a good representation method of the sample [35]. The first picture is the default color picture of MFCC, ranging from blue (low range) to green (middle range) to red (high range), grayscale is a sequential grayscale mapping from black (low range) to gray (middle range) to white (high range), and finally, Stolar, M.N. et al. [36]. The RGB pictures are uniform sequential color pictures from blue (d) to green (c) to red (b). They are divided into the images shown in Figure 12. In the first step, we used MFCC to extract the frequency spectrum of the audio file of the video clip (such as advertisements) with a certain sampling rate. The sampling rate, Hamming window width, and sliding size are 30 ms, 4.15 s, and 50 ms respectively, and the output spectral density is calculated from the DB power scale. The median size of the spectrum is 371 × 501 pixels.

MFCCs and the Spectrogram RGB Image Feature Method
In the first step, we used MFCC to extract the frequency spectrum of the audio file of the video clip (such as advertisements) with a certain sampling rate. The sampling rate, Hamming window width, and sliding size are 30 ms, 4.15 s, and 50 ms respectively, and the output spectral density is calculated from the DB power scale. The median size of the spectrum is 371 × 501 pixels.
The second step is to balance the frequency spectrum (remove noise). At this stage, the frequency spectrum is balanced, which means applying pre-emphasizing filters to the signal to amplify the high frequencies. The pre-processing filter balances the spectrum because the high frequency is smaller than the low frequency according to ear recognition.
The third step is the NN-point FFT calculation. NFFT (number of segments of fast Fourier transform) is 512, and the power spectrum can be calculated by the following formula: 15 degrees represents the results after executing this process. The five steps of applying a triangle filter to the power spectrum in the calculation of a filter bank are to extract the frequency band by applying a triangular mel-scale filter bank (usually 40 filters, n = 40) to the power spectrum. The mel Scale aims to mimic non-linear human ear recognition, in which audio becomes more discriminative at lower frequencies than that at higher frequencies. We can switch between hertz (f) and mel (m) by using the following formula. Therefore, a discrete cosine transform can be applied to decorate the coefficients of the filter bank, and the filter bank can be represented comprehensively. The last step is to transform (calculate) the spectrum with an RGB image. Spectral representations of the three frequency ranges are converted to RGB images to observe the effects of high-, medium-, and low-frequency characteristics. For example, the high-frequency band (15,000-22,500 Hz) is red, the middle-frequency band (7500 Hz) is green, and the low-frequency band (0-7500 Hz) is blue. Using red (r), green (g), or blue (b) components in RGB images, the importance of audio components with high, medium, and low amplitude levels, respectively, are calculated as acoustic characteristics.
In the empathy-evaluation method, using image features according to practical examples, the above-mentioned audio features may include the following: low power, middle power, and high power. The above-mentioned video element and audio attributes are used as learning data. The flow chart of this algorithm is shown in Figure 13. The second step is to balance the frequency spectrum (remove noise). At this stage, the frequency spectrum is balanced, which means applying pre-emphasizing filters to the signal to amplify the high frequencies. The pre-processing filter balances the spectrum because the high frequency is smaller than the low frequency according to ear recognition.
The third step is the NN-point FFT calculation. NFFT (number of segments of fast Fourier transform) is 512, and the power spectrum can be calculated by the following formula: 15 degrees represents the results after executing this process. The five steps of applying a triangle filter to the power spectrum in the calculation of a filter bank are to extract the frequency band by applying a triangular mel-scale filter bank (usually 40 filters, n = 40) to the power spectrum. The mel Scale aims to mimic non-linear human ear recognition, in which audio becomes more discriminative at lower frequencies than that at higher frequencies. We can switch between hertz (f) and mel (m) by using the following formula. Therefore, a discrete cosine transform can be applied to decorate the coefficients of the filter bank, and the filter bank can be represented comprehensively. The last step is to transform (calculate) the spectrum with an RGB image. Spectral representations of the three frequency ranges are converted to RGB images to observe the effects of high-, medium-, and low-frequency characteristics. For example, the high-frequency band (15,000-22,500 Hz) is red, the middle-frequency band (7500 Hz) is green, and the low-frequency band (0-7500 Hz) is blue. Using red (r), green (g), or blue (b) components in RGB images, the importance of audio components with high, medium, and low amplitude levels, respectively, are calculated as acoustic characteristics.
In the empathy-evaluation method, using image features according to practical examples, the above-mentioned audio features may include the following: low power, middle power, and high power. The above-mentioned video element and audio attributes are used as learning data. The flow chart of this algorithm is shown in Figure 13.

Machine Learning Classifier
We tried many classification methods and analyzed their advantages and disadvantages. The classifiers selected here are boosting tree classifier, decision tree classifier, MLP (multilayer perceptron), random forest classifier, KNN (K-nearest neighbor), and SVM (support vector machine) [36][37][38][39][40][41]. We have used 10-fold cross validation and standard parameters for each algorithm. The parameters used are specified in Section 3.2.2. The boosting tree model uses an additive model (a linear combination of basic functions) and a forward step-by-step algorithm. At the same time, the basic functions use a decision tree algorithm, in which a binary classification tree is used for classification problems, and a binary regression tree is used for regression problems.
Random forest is a meta-estimator, which is suitable for multiple decision tree classifiers on each sub-sample of the data set. It uses the average value to improve prediction

Machine Learning Classifier
We tried many classification methods and analyzed their advantages and disadvantages. The classifiers selected here are boosting tree classifier, decision tree classifier, MLP (multilayer perceptron), random forest classifier, KNN (K-nearest neighbor), and SVM (support vector machine) [36][37][38][39][40][41]. We have used 10-fold cross validation and standard parameters for each algorithm. The parameters used are specified in Section 3.2.2. The boosting tree model uses an additive model (a linear combination of basic functions) and a forward step-by-step algorithm. At the same time, the basic functions use a decision tree algorithm, in which a binary classification tree is used for classification problems, and a binary regression tree is used for regression problems.
Random forest is a meta-estimator, which is suitable for multiple decision tree classifiers on each sub-sample of the data set. It uses the average value to improve prediction accuracy and control overfitting. The sub-sample size is always the same as the original input sample size but if bootstrap is true (the default value) replacement will be used to draw the sample. The performance (function) measures the quality of the split. The supported standards are "gini" for Gini impurity and "entropy" for information gain.
Decision tree learning uses a top-down recursive method, and its basic idea is to construct a tree with the fastest decrease in entropy as a measure of information entropy. The entropy value of the leaf node is zero and the instances in each leaf node belong to the same class currently. The most important advantage of the decision tree algorithm is that it can be self learning. In the learning process, the user need not have much background knowledge but can only label the training examples in a better way. The key to establishing the decision tree is to determine which attribute is selected as the classification basis in the current state. According to different objective functions, there are three main algorithms for building a decision tree: ID3, C4.5, and CART.
The support vector machine (SVM) is non-linear mapping. It uses the inner product kernel function to replace the non-linear mapping in the high-dimensional space. The goal of SVM is to divide the optimal hyperplane of feature space, and the idea of maximizing the classification margin is the core of the SVM method. The SVM learning problem can be expressed as a convex optimization problem so that the known effective algorithms can be used to find the global minimum of the objective function. For this method, the generalization error rate is low, the classification speed is fast, and the results are easy to interpret. However, the disadvantage is that the SVM algorithm is difficult to implement for large-scale training samples as it is sensitive to missing data, selections of parameters, and kernel functions, and the "dimensional disaster" has to be avoided. However, for large-scale training samples, the SVM algorithm is very difficult to implement.
The KNN algorithm is a simple and robust classification technology. Test feature vectors are classified by looking for the k nearest neighbor vectors. The distance between the training vector and the test vector is calculated by measuring the difference in measurement technique [42]. Its training time is not very complicated, except that it has very high precision with no assumption data. The disadvantage is the large amount of calculation, the non-balanced sample, and that it requires a large amount of memory. The KNN's algorithm classifier will be more suitable for the feature data of the spectrogram extracted from the audio.

Mel-Frequency Cepstral Coefficients (MFCCs) Result
The assumption we have made is that the extracted audio feature can distinguish between empathetic and non-empathetic videos. The first method extracted the audio features in the video that are the independent variables along with its sound features shown in Figure 14. The mel-frequency cepstral coefficients (MFCCs) of a signal are a small set of features that concisely describe the overall shape of a spectral envelope. Eigenvalues were extracted for t-test statistical analysis, which tested the two previous hypotheses and were carried out by adjusting the α levels of each test by 0.05. Results showed that empathy and non-empathy were mainly influenced by the healthy body elements (coefficients F1, F2, F3, F4, F5, F6, F7, F8, F10) between the two groups (p < 0.001). The differences obtained from the data are shown in Tables 2 and 3.

Spectrogram RGB Image Feature Result
The second method of audio extraction is extracting spectrogram RGB images of audio features. Audio characteristics obtained by this method are low power, medium power, high power, low frequency, middle frequency, and high frequency. The results show that empathy or non-empathy of the two groups were mainly influenced by the healthy physical characteristics (high power mean, medium power mean, and low power mean) (p < 0.001). The two differences obtained from the data are shown in Tables 4 and 5. Table 4. Average error (p < 0.001) of characteristic values of audio and physical elements (low power, medium power, high power, low frequency, middle frequency, and high frequency) compared with empathetic and non-empathetic labels in experiment one. Finally, the third method is a combination of the first two methods. The difference between videos is whether they are the main reason for empathy or not. Using the MFCC method (with a similar structure of the human ear vortex) and the power spectrum of RGB extracts with feature values such as low power, middle power, high power, low frequency, middle frequency, high frequency, the results indicated that empathy or non-empathy of the two groups of participants was mainly affected by the sound and physical elements featured (low-power mean, low-power std, middle-power mean, middle-power std, highpower std) (p < 0.001) deviation) (p < 0.001). The differences obtained from the data are shown in the Tables 6 and 7. Table 6. Average error of characteristic value of audio and physical element features used for difference comparison between empathetic/non-empathetic labels in experiment one (low-power average, low-power standard, medium-power average, medium-power standard, and high-power means high-power standard) (p < 0.001).

Label
Low   Table 7. Average error of characteristic value of audio and physical element features used for difference comparison between empathetic/non-empathetic labels in experiment two, (low-power average, low-power standard, medium-power average, medium-power standard, and high-power means high-power standard) (p < 0.001). According to the statistical analysis results of eigenvalues obtained by the three methods, it is found that in the low-frequency domain it is easier to induce human empathy as the audio power increases. The average error number of the group without the empathy label was significantly higher (m = 2371.9, SD = 29.2; p < 0.001) than that of the empathy group (m = 1587.4, SD = 25.8; p = 0.001) in the system. As the frequency increases, the ability to induce empathy gradually weakens, such as the value of low power means of empathy label and high-power means of the empathy label: the average number of errors in non-empathy label group is significantly higher (m = 1587.4, SD = 25.8; p < 0.001) was higher than that of empathetic group (m = 1.48, SD = 0.09; p < 0.001). The result of extracting features in low power and high power was the difference between the empathetic and non-empathetic label in experiment one. The results of experiment two also have the same trend as experiment one in the empathetic and non-empathetic labels as shown in Figure 15.

Classification for All Methods Result
In the mixed model of RGB image scale features and MFCC spectrogram (the RGB scale spectrogram and MFCC spectrogram are given in Table 5), the audio feature values Figure 15. The flow chart of three methods for empathy evaluation in audio signal processing.

Classification for All Methods Result
In the mixed model of RGB image scale features and MFCC spectrogram (the RGB scale spectrogram and MFCC spectrogram are given in Table 5), the audio feature values are extracted respectively, and are statistically analyzed. Finally, the meaningful feature value is used for classification by the machine learning classifier.

Multi-Conditional Training
The classifiers selected here are boosting tree classifier, decision tree classifier, MLP (multilayer perceptron), random forest classifier, KNN (K-NearestNeighbor), and SVM (support vector machine). You can see from the results in Table 8 that for the characteristics of the audio signal, the accuracy of the SVM classifier is higher than the other two classifiers, which achieved better predictive value. From the accuracy rate, it can be seen that the mixed model using the MFCC filter and RGB image processing has the highest accuracy rate with 99.173% training data and 86.171% test data, which are better than that of other classifiers: AdaBoost classifier (78.092% training data; 78.125% test data), decision tree (83.824% training data; 81.196% test data), SVM classifier (75.3% training data; 62.3% test data), KNN classifier (71.2% training data; 60.5% test data), and MLP classifier (67.823% training data; 67.9% test data).

Classification and Evaluation
The classifiers selected here are boosting tree classifier, decision tree classifier, MLP (multilayer perceptron), random forest classifier, KNN (K-nearest neighbor), and SVM (support vector machine) [36][37][38][39][40]. We have used 10-fold cross validation and standard parameters for each algorithm. The parameters used are specified in Tables 9-13. AdaBoost uses an additive model to linearly combine weak classifiers. For example, AdaBoost uses a weighted majority vote to increase the weight of a classifier with a small error rate and reduce the weight of a classifier with a large error rate. The data of this experiment found that the maximum training accuracy rate was 78.092% by changing the learning rate to 0.8 and the test accuracy rate to 78.125%. Each decision tree of a random forest is a classifier (assuming that it is a classification problem). N trees will have N classification results for an input sample. When the random forest integrates all the classification voting results, the category with the most votes is designated as the final output. From the data results of this experiment, when there are 80 trees the maximum training accuracy rate is 99.173% with 86.171% test accuracy rate. With the maximum depth that the tree is allowed to grow to, the deeper you allow the more complex the model is. There is a good golden point between too high and too low. Think of maximum_depth as a hyperparameter and use grid/random search with cross-validation to find a good value for the maximum_depth. We found that a maximum depth of 10 is the best training result, when the training accuracy rate is 83.824%, and the test accuracy rate is 81.196%. For the number of iterations of MLP, the accuracy of the training set and the test set are also different. We chose to iterate from 200 to 500 every 50 steps. The result obtained is that when the audio training set is iterated to 350 times using the MLP classifier, the accuracy rate is no longer obvious. The accuracy rate of the training set and the accuracy rate of the test set is maintained at 67.823% and 67.9%, respectively.
The advantage of the SVM classifier lies in that it is a kernel function. Generally it uses 'RBF', 'linear', 'poly', etc. As shown in Figure 16, it is found that the function model works best when using 'RBF' parameters. Another important parameter is that a larger C is equivalent to punishing the slack variable. It is hoped that the slack variable will be close to 0 and that the penalty for misclassification will increase, which tends to be the case of fully splitting the training set. It is why the generalization ability is weak with a high accuracy of training set. The value of C is small, which reduces the penalty for misclassification, allows fault tolerance, is regarded as noise points, and has strong generalization ability [43,44].
works best when using 'RBF' parameters. Another important parameter is that a larger C is equivalent to punishing the slack variable. It is hoped that the slack variable will be close to 0 and that the penalty for misclassification will increase, which tends to be the case of fully splitting the training set. It is why the generalization ability is weak with a high accuracy of training set. The value of C is small, which reduces the penalty for misclassification, allows fault tolerance, is regarded as noise points, and has strong generalization ability [43,44]. In the machine learning model, the parameters that need to be selected manually are called hyperparameters. The name of 'GridSearchCV' can be divided into two parts, 'GridSearch' and CV, namely grid search and cross-validation. Both names are very easy to understand. Grid search, searches for parameters within the specified parameter range, adjusts the parameters in sequence according to the step length, uses the adjusted parameters to train the learner, and finds the parameter with the highest accuracy on the verification set from all the parameters. This is a process of training and comparison. GridSearchCV can guarantee to find the most accurate parameter within the specified parameter range, but this is also the flaw of grid search. It requires traversing all possible parameter combinations, which is very time consuming in the face of large data sets and multiple parameters. In this analysis SVM is used for classification, evaluation, and to establish a software-obsolete evaluation model. Firstly, the GridSearchCV class is used to select the optimal gamma C and gamma of the SVM, and the typical C and gamma are selected for the svm.SVC model [44]. The results of the SVM model evaluation are shown in Table 6. From the results of NuSVC, SVC, and LinearSVC in the SVM model, SVC is the best choice in this model. The kernel function of svm.SVC (gamma:1; learning rate: 0.001) is more suitable for audio processing when using the SVM classifier shown in Table 6. In the machine learning model, the parameters that need to be selected manually are called hyperparameters. The name of 'GridSearchCV' can be divided into two parts, 'GridSearch' and CV, namely grid search and cross-validation. Both names are very easy to understand. Grid search, searches for parameters within the specified parameter range, adjusts the parameters in sequence according to the step length, uses the adjusted parameters to train the learner, and finds the parameter with the highest accuracy on the verification set from all the parameters. This is a process of training and comparison. GridSearchCV can guarantee to find the most accurate parameter within the specified parameter range, but this is also the flaw of grid search. It requires traversing all possible parameter combinations, which is very time consuming in the face of large data sets and multiple parameters. In this analysis SVM is used for classification, evaluation, and to establish a software-obsolete evaluation model. Firstly, the GridSearchCV class is used to select the optimal gamma C and gamma of the SVM, and the typical C and gamma are selected for the svm.SVC model [44]. The results of the SVM model evaluation are shown in Table 6. From the results of NuSVC, SVC, and LinearSVC in the SVM model, SVC is the best choice in this model. The kernel function of svm.SVC (gamma:1; learning rate: 0.001) is more suitable for audio processing when using the SVM classifier shown in Table 6.

Conclusions
In this paper, 24 advertisement videos on YouTube and 8 drama video clips are selected as stimuli. The subjects watched the videos and made a subjective assessment on empathy levels. In this paper, the audio signal in the advertisement was mainly used for analysis. There were three analysis methods used: MFCC, RGB spectrogram, and filtered RGB spectrogram of mixed model MFCC. When using the statistical t-test, there were differences in audio feature values between the empathy tags and non-empathy tags. We then used six classifiers with higher recognition of machine learning methods to learn and classify the collected feature values, and calculate with of the obtained accuracy rates is the highest: boosting tree classifier, decision tree classifier, MLP (multilayer perceptron), random forest classifier, KNN, and SVM. The correct training rate of the random forest classifier is 99.173% and the test rate of the random forest classifier with 86.171%. The training accuracy and test accuracy of RGB spectrogram filtered by MFCC, a hybrid model of the three feature extraction methods, is the highest.
The method used in statistics is to compare the power in the low frequency domain. As the audio power increases, it is more likely to cause empathetic resonance. As the frequency increases, the result of the ability to induce empathy is discovered. In a word, the contributions of this study are as follows. First, for those who make commercial video, music, and music with low frequency and high amplitude are more likely to resonate with video than audio with high frequency and high amplitude. Secondly, by observing the features of the machine learning classifier, it can be found that frequency and audio amplitude are important attributes to describe waves. Thirdly, a new audio extraction method is proposed to induce human empathy. That is, the feature value extracted by the method of the spectrogram image features of audio has the most ability to arouse human empathy.

Discussion
The conclusion is that increasing the strength of audio in advertising production enhances the viewers' empathy in this system. The experiments presented here have demonstrated clear advantages of the proposed methodology by comparing them to traditional methods utilizing arrays of spectrogram magnitudes [45][46][47]. This agrees with Andrei C. Miu, who has said that positive emotions trigger empathy more than negative emotions [48], and Jing Z, etc. the power of positive emotion audio is higher than the power of negative emotion audio [23]. Yoo, S. et al. who thought the effect of a co-viewer may only impact on empathetic responses when participants felt higher emotional intensity [32]. Therefore, it can be concluded that the ability of inducing human empathy would increase with an increase in frequency, e.g., the value of low-power mean of the non-empathetic label and the empathetic label where the average number of errors is significantly higher in the group of non-empathetic labels than those with empathetic labels. As the frequency increases, the ability to induce empathy gradually weakens, e.g., the value of low-power and high-power means of the empathetic label where the average number of errors was significantly higher in the group with the non-empathetic labels than those in the group of empathetic labels.
In the future, the image feature value and the audio feature value in the advertisement would be combined in order to achieve a greater effect on the empathy stimulation of the advertisement.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Sangmyung University, Seoul, Korea (BE2018-35).

Informed Consent Statement:
Written informed consent has been obtained from the patient(s) to publish this paper.