Sentiment Analysis on Online Videos by Time-Sync Comments

Video highlights are welcomed by audiences, and are composed of interesting or meaningful shots, such as funny shots. However, video shots of highlights are currently edited manually by video editors, which is inconvenient and consumes an enormous amount of time. A way to help video editors locate video highlights more efficiently is essential. Since interesting or meaningful highlights in videos usually imply strong sentiments, a sentiment analysis model is proposed to automatically recognize sentiments of video highlights by time-sync comments. As the comments are synchronized with video playback time, the model detects sentiment information in time series of user comments. Moreover, in the model, a sentimental intensity calculation method is designed to compute sentiments of shots quantitatively. The experiments show that our approach improves the F1 score by 12.8% and overlapped number by 8.0% compared with the best existing method in extracting sentiments of highlights and obtaining sentimental intensities, which provides assistance for video editors in editing video highlights efficiently.


Introduction
With the boom of online video websites, more and more people are likely to watch videos online. Those websites not only bring convenience in watching videos but also provide functions for people to make comments on videos. However, since a huge amount of videos are uploaded to the websites every day, it is hard for one to watch every minute in the videos. In this circumstance, audiences may prefer to watch video highlights, which are composed of excellent video fragments instead of watching entire videos.
Video highlights are a crucial aspect of video content as they provide audiences with a condensed version of the most interesting and meaningful parts of the video. However, the process of manually editing these highlights is time-consuming and labor-intensive, making it essential to find a more efficient way to locate the video highlights. In recent years, sentiment analysis has emerged as a promising approach of automatically recognizing sentiments of video highlights using time-sync comments.
Time-sync comments (TSCs) are messages that users send while watching a video to express their thoughts and feelings about what they are seeing. These comments appear on the screen at the moment they are made and reflect the users' mood during that particular segment of the video. By analyzing the time-sync comments, we can gain insights into the emotions of the viewers and even predict the emotional trajectory of the video. In this paper, we mainly conduct experiments on Chinese time-sync comments. These comments are often used to express various emotions and moods, ranging from happiness and excitement to sadness and frustration. For example, viewers may leave comments like "OMG" or "lol" to express their amusement or laughter, while comments such as "so sad" or "heartbreaking" can indicate a feeling of sadness or sympathy.
By analyzing the sentiment of time-sync comments, we can detect sentiment information in the time series of comments and use this information to extract the most interesting or meaningful parts of the video. Furthermore, we can quantify the sentimental intensity of these shots using a sentimental intensity calculation method.
In this paper, we propose a TSC-based sentiment analysis model to extract highlights from videos and calculate their sentiment intensity. The main contributions include: (1) a sentiment fragments detection model for videos using TSC data is proposed to detect video fragments with strong sentiment from videos, (2) a highlight extraction strategy is designed to find video highlights, and (3) a sentiment intensity calculation method for video fragments is constructed in order to compute sentiments of video fragments quantitatively.
The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 defines two problems of sentiment analysis on online videos. Two sentiment analysis strategies using TSC are proposed in Sections 4 and 5. Section 6 evaluates the performance of the model using a TSC dataset. We conclude our work in Section 7.

Time-Sync Comments
Time-sync comments, first introduced in academia [1], are widely used in video websites, such as Acfun, Bilibili, and YouKu, which are some of the most popular video websites in China. One TSC is composed of a comment and a time stamp. It is a comment by an audience, which shows the audience's opinion on a video shot.
The time stamp is synchronized to the shot's playback time in the video [2]. TSCs are used for video classification tasks [3]. Current researchers use TSCs to extract video highlights [4][5][6]. Moreover, current approaches are beginning to apply TSCs to the emotional analysis of videos [7,8]. Bonifazi et al. [9] take into account the similarity between patterns and put forth a content semantic network called CS-Net to handle reviews. To measure the similarity between two networks, they calculated the similarity of structural features across different networks. As TSCs of a video indicate opinions of audiences on the shots of the video, text analysis of the TSCs is able to extract details for every single shot of a video. Moreover, the extraction results reflect not only explicit information but also implicit information.

Video Highlight Extraction
The work of video highlight extraction is mainly carried out by editors of online video websites manually. In order to extract highlights in videos, those editors have to watch the whole videos first. Then, they select video fragments that are interesting and may be welcomed by audiences. Lastly, the video fragments are re-edited and re-organized as video highlights. As such work is inefficient, it is necessary to provide a method that can extract interesting video fragments automatically. Recently, some researchers have begun to use TSCs for video highlight extraction. One work proposes to use "global + local" sentiment analysis to find highlights [5]. Another work proposes to use lag-calibration and the combination of topic and emotion concentration in an unsupervised way to detect highlights [6]. Actually, in a video, fragments that are welcomed by audiences always indicate one or more sentiments strongly. Therefore, in achieving the goal of welcomed video fragment extraction, sentiment detection for video fragments is the key process.

Sentiment Analysis
Many researchers have focused on detecting sentiment using image-based approaches. A number of researchers track the human face [10][11][12][13] or human pose [14][15][16][17][18], while some other researchers extract semantic features of sentiment from images [19][20][21][22][23][24]. However, compared with text-based processes, image-based approaches consume more time and cost more computational resources, but achieve less accuracy [25]. Additionally, labels extracted by the image-based approaches can only reflect explicit sentiments [26]. By contrast, both explicit and implicit sentiments can be detected by analyzing audience comments using the text-based approaches.
As the textual approaches have those advantages, many efforts have been directed to text-based analysis [27][28][29][30][31][32][33][34][35][36]. Nevertheless, current approaches either assign sentiment tags to whole videos instead of a single shot [37] or treat the video shots as independent objects [38], while a video segment constitutes a group of the shots that may have relations with preceding and following shots. Bonifazi et al. [39] propose a general framework capable of analyzing the range of sentiment associated with any topic on any social network.
In conclusion, while researchers primarily focus on tasks such as video classification and video clip recommendation using TSCs, they often overlook the potential of using TSCs for video highlight extraction and calculating the sentimental intensity of those highlights. Therefore, we propose a four-step strategy for extracting sentiment highlights in videos, which involves identifying and grouping together adjacent video fragments that share similar sentiment. Moreover, we introduce a strategy for quantitatively measuring the sentimental intensity of a highlight, taking into account not only the types of sentiment implied but also the strength of the sentiment within each type. By employing these strategies, we aim to enhance the understanding and representation of contents having various sentiment within videos.

Illustration of Time-Sync Comments
A time-sync comment is composed of text-based comments and time stamps. The comment is usually a sentence of fewer than 20 words. Sometimes it is a text symbol representing an emotion, such as OMG standing for surprise, LOL meaning happiness, and 233333 expressing a laugh in habits of people who are using TSCs. The time stamp records the playback time of a video shot, and it is synchronized to the comments on the shot. Figure 1 shows an example of two shots and their TSCs in the video Forrest Gump. In the figure, Is she Jenny?! and She is beautiful are two TSCs on the shot whose playback time is 13:43, and He was shot and It's so affecting are another two TSCs that are synchronized to the time stamp 54:13.
The sentiment features of a video shot are indicated by TSCs. For example, She is beautiful reflects that the sentiment of the current shot is close to LIKE rather than HATE. In addition, It's so affecting means that the fragment close to the playback time 54:13 contains a positive sentiment instead of a negative one.

Formal Definition
Let v be a video. Let T start and T end be the start time and finish time of v, respectively. Let T v be the length of v. We have is the i-th fragment and N F is the number of fragments. We use T start,i and T end,i to represent the start time and finish time of f v,i . We define that, for any there is a interval I(I < T f ) between the start time of f v,i and that of f v,i+1 . That is I = T start,i+1 − T start,i . It means every two adjacent fragments have an (T f − I)-length overlap. Thus, T v = I × (N F − 1) + T f . Obviously, Usually, T f is far less than T v , and I is less than T f . Therefore, the number of fragments in v is approximately T v I . Suppose T f is small enough that makes one fragment unable to display a complete highlight. It means that a fragment is only a part of a highlight. In another words, a highlight consists of more than one continuous fragment when T f is small. Let Suppose there are k types of sentiments. Let S = {s 1 , s 2 , . . . , s k } be the set of sentiments. Sentiment intensity of a highlight, h v,i , is defined as E d,hv,i = (e 1 , e 2 , . . . , e k ). It is a vector that shows intensity distribution in the k types of sentiments for the highlight h v,i . For any e j (1 ≤ j ≤ k), it is an intensity value of sentiment type s j in h v,i .
Let B v be the set of TSCs in v, and B f v,i be the set of TSCs in which is a set of words or text symbols, t b is b's time stamp, and u b represents a user ID of an audience who sends b. Let N U be the total number of audiences who send comments to v. Let T sync (w) be a time stamp that is synchronized to a comment w, and user(w) be the user who sends w. In the case of tuple (w The notations defined are listed in Table 1.
User who sends TSC b N U The number of users who send TSCs in video v

Problem Statement
Under the formal description, the problems of sentiment highlight extraction and sentiment intensity calculation are defined. The two problems are described as follows.
(1) Problem of Sentiment Highlight Extraction: Given v and B v . For any 1 < i < N F , to find l i and r i to satisfy all the constraint conditions below, (2) Problem of Sentiment Intensity calculation: Given H v , B v , and S. For any 1 ≤ i ≤ N H , find a vector (e 1 , e 2 , . . . , e k ) that shows intensity distribution in (s 1 , is the value of intensity in s j and s j ∈ S.
As fragments in the same highlight reflect similar sentiment, the problem of highlight extraction is how to gather fragments that have similar sentiment together. If the problem is solved, we can obtain a set of highlights, After obtaining the set of highlights, H v , the sentiment intensity of each highlight in H v can be computed by solving the problem of highlight sentiment intensity calculation using the TSC set.

Sentiment Highlight Extraction
A strategy of sentiment highlight extraction is used to extract highlights in a video by gathering video adjacent fragments that have similar sentiment together. It is mainly composed of four steps: (1) TSC vectors of all fragments are constructed, (2) similarity matrices of all fragments are generated to measure similarities among user comments, (3) feature similarity of each fragment is calculated, and (4) the highlight score of each fragment is calculated. The processes of the strategy are shown in Figure 2. The details of the four steps in Figure 2 are described in the four subsections in Section 4.

Construct TSC Vectors
We construct a TSC vector, where T start,i and T end,i are the start time and finish time of fragment f v,i , respectively.

Generate Similarity Matrices
A similarity matrix is generated for each fragment. It reflects the similarities of comments from different users on the same fragment. A similarity matrix, j,k be the elements at the j-th row and k-th column in M f v,i . Then, m is a similarity factor such as cosine similarity, and w

Calculate Feature Similarity
After obtaining the similarity matrix M f v,i for fragment f v,i , we easily obtain M f v,i 's largest real eigenvalue and its corresponding eigenvector, p i . The Perron-Frobenius theorem ensures that components in p i are positive values. Values in p i are thought of as features of "sentiment" implied by audiences' comments on fragment f v,i .
Since p i represents features of f v,i , we calculate the mean value of features of the nearest m fragments before f v,i . The mean value, p i,mean , is calculated by Equation (1).
The feature similarity of fragment f v,i , notated as S f v,i , is the similarity of p i and p i,mean . The similarity is calculated using the cosine function, which is S f v,i = cos (p i , p i,mean ).

Finding Video Highlights
Firstly, highlight scores of all fragments are calculated in order to decide which fragments are put together in the same highlight. R f v,n , the highlight score of fragment f v,n , is calculated by Equation (2), where D f v,n is the TSC density in f v,n , defined as the number of all TSCs commented on, f v,n .
The larger TSC density a fragment has, the stronger sentiment the fragment manifests. It is attributed to the fact that people prefer to express their opinions when they feel a fragment is interesting or meaningful, which makes the number of TSCs increase.
Next, fragments that have high highlight scores are selected as single highlights. A highlight score of a fragment indicates the possibility that the fragment is considered as a highlight. The higher a fragment's highlight score is, the higher is probability that the fragment may become a highlight.
A highlight threshold, δ, is set for single highlight detection. If R f v,i , the highlight score of fragment f v,i , is larger than the highlight threshold, δ, and f v,i is selected as a single highlight.
After that, relevant single highlights are merged into one highlight. For any two fragments, f v,i and f v,j , they will be merged as a highlight if (a) where δ is the highlight threshold, and θ is a link threshold for deciding whether two fragments have strong relevance in sentiment.
Under the strategy, a fragment will be merged with its neighboring fragment if the two fragments are relevant in sentiment and both of them are single highlights. Moreover, three or more adjacent fragments can be merged as a highlight.
Lastly, a highlight set is obtained by putting all the highlights together. We can obtain a highlight set,

Sentiment Intensity Calculation
A strategy of sentimental intensity is used to measure the strength of sentiment for a highlight quantitatively. It reflects not only which sentiment types the highlight implies, but also how strong the highlight's sentiment is in each type. In this paper, we choose TSCs in Chinese language to analyze sentiment intensity because Chinese is the most popular language in TSCs. For TSCs in other languages, the sentiment intensity can still be calculated using grammar rules of the languages and conventional sentiment analysis methods such as Bidirectional Encoder Representations from Transformers (BERT) in the same way.

Word Groups Division for TSCs
Using the strategy of sentiment highlight extraction, a set of highlights, H v , is extracted from video v. For a highlight h v,i ∈ H v , it is composed of one or more adjacent fragments.
Let CMT hv,i be a set of TSC comments that are commented in fragments of h v,i . Thus, Through linguistic analysis, sentiments implied in a sentence are impacted by some special words in the sentence. In the case of TSCs, there are three categories of special words, which are emotional words, adverbs, and negative words. An emotional word in comments expresses some kinds of sentiments and their intensity. An adverb strengthens or weakens sentiment intensity for a comment. A negative word changes the meaning of a comment completely. For example, both the sentences I am a little bit happy and I am very happy express the sentiment of HAPPY, but the sentiment of the second sentence is much stronger than that of the first one. It is attributed to the fact that very is an adverb whose weight is much greater than a little bit. Another example, I am happy, shows the sentiment of HAPPY, while I am not happy describes a sentiment opposite to HAPPY, i.e., probably SAD.
Emotional words in CMT hv,i can be selected according to a dictionary of emotional words. Sentiment intensities of the emotional words can also be obtained from the dictionary. Actually, for an emotional word, d j , its sentiment intensity, E d,dj = (e 1 , e 2 , . . . , e k ), is a distribution of sentiment strengths on the k types of sentiments, and e j (1 ≤ e j ≤ k) is the strength of d j on the j-th sentiment type.
Most words in TSCs can be covered by the dictionary. However, there are some new terms that are not included in the dictionary. For those emotional terms that exist in CMT hv,i but are not found in the dictionary, we extend the dictionary by setting a sentiment type and a value of sentiment intensity. There are two available approaches to extend the sentiment dictionary. One method uses a dictionary of synonyms, and new terms are synonymous with existing ones. We replace new terms with terms from the existing sentiment dictionary, thus obtaining a similar sentiment intensity. Another approach uses the original sentiment dictionary as a foundation and calculates the semantic similarity between new terms and those terms in the sentiment dictionary. It allows for the extension of the sentiment dictionary based on the semantic associations between terms. As the dictionary extension approaches are beyond the topics of this paper, it will not be introduced in the details of approaches in this paper.
Like a sentiment intensity, E d,dj , of an emotional word, d j , can be obtained from the dictionary of emotional words, a weight, W D , for an adverb, D, can be obtained through a dictionary of adverbs. Similarly, negative words in CMT hv,i are able to be found easily from a dictionary of negative words.
Suppose there are N D,i emotional words in CMT hv,i , and the words of comments in CMT hv,i are organized into N D,i groups {G 1 , G 2 , . . . , G N D,i }. Each emotional word with its related adverbs and negative words are put into the same group. Thus, every group contains only one emotional word and may include one or more adverbs and negative words. Figure 3 shows groups of TSC words.

Sentiment Intensity Calculation for Highlights
According to the definition in Section 2, E d,hv,i = (e 1 , e 2 , . . . , e k ) is the sentiment intensity of highlight h v,i , where k is the number of sentiment types, and e j (1 ≤ j ≤ k) is an intensity value of the j-th sentiment type in h v,i .
The sentiment intensity of G j (1 ≤ j ≤ N D,i ) is affected by adverbs and negative words in G j . The sentiment intensity of G j is calculated in three situations: (a) There is neither an adverb nor negative word in G j . The sentiment intensity of G j is the same as that of emotional word d j , which is where E d,dj is the sentiment intensity of emotional word d j . (b) There is no adverb but there are N n (N n ≥ 1) negative words in G j . Since a negative word oppositely affects a emotional word, in Chinese grammar, the presence of an even number of negative words at the same time indicates a stronger positive meaning, while the simultaneous appearance of an odd number of negative words indicates a stronger negative meaning. Therefore, according to the number of negative words that appear, the sentiment intensity of G j is calculated as where E d,dj is the sentiment intensity of emotional word d j . (c) There is no negative word but there is one adverb in G j . The sentiment intensity of G j is calculated as where W D is the weight of adverb D, and E d,dj is the sentiment intensity of emotional word d j . (d) There are both adverbs and N n (N n ≥ 1) negative words in G j . As comments in CMT hv,i are Chinese characters, according to Chinese linguistic features, if there is more than one adverb in word group G j , then we consider G j to be not grammatical, so we just consider that there is one adverb or less in G j . At the same time, an adverb written before or after a negative word affects the sentiment intensity of a word group differently. If the position of an adverb is before all negative words in G j , the sentiment intensity of G j is calculated as where W D is the weight of adverb D, and E d,dj is the sentiment intensity of emotional word d j .
If there are N n 1 (1 < N n 1 ≤ N n ) negative words before D, and N n 2 (N n 2 = N n − N n 1 ) negative words after D, the sentiment intensity of G j is calculated as where W is the parameter to weaken sentiment intensity, W D is the weight of adverb D, and E d,dj is the sentiment intensity of emotional word d j .
From the processes above, we can obtain the sentiment intensity of each word group, G j , in CMT hv,i . Then, we use the sentiment intensity of all word groups to generate the sentiment intensity of a video highlight. The sentiment intensity of highlight, h v,i , is calculated as where E d,Gj is the j-th word group in CMT hv,i , N D,i is the number of word groups in CMT hv,i , T start,s i and T end,s i +N i −1 are the start point and end point of video highlight h v,i , respectively, and I is the interval between T start,s i and T start,s i+1 .
The sentiment intensity, E d,hv,i , is an average value of total sentiment intensity in the highlight, h v,i , per unit time.

Experiment Setup
A TSC dataset that includes approximate 16 million TSCs is used to evaluate the performance of our proposed work. The TSCs are collected from 4841 online videos, which contain movies, animation, TV series and variety shows.
Emotional words ontology (http://ir.dlut.edu.cn/info/1013/1142.htm (accessed on 1 May 2023)), provided by the Dalian University of Technology, is used to build up our sentiment dictionary. In the dictionary, each word is related to a sentiment intensity, a 7-dimension vector. Each dimension represents one of seven kinds of sentiment, which are happy, good, angry, sad, afraid, hate, and shock.
We randomly selected 34 movies on the Bilibili website, including action movies, comedy movies, fantasy movies, horror movies, etc. The TSCs of movies including Spider-Man: Homecoming, Harry Potter and the Philosopher's Stone, Green Book, Charlie Chaplin, The Shawshank Redemption, Secret Superstar, etc. from the dataset were chosen for our experiments. In the experiments, fragment length T f is set to 30 s and fragment interval I is set to 20 s. Different movies have different numbers of time-sync comments. We randomly selected 5000 time-sync comments for each movie. We combined the movie categories on the iMDb website and the sentiment analysis of all the time-sync comments of the movies to classify the selected movies in the experiments. The basic information of the movies is shown in Table 2.
There are some highlights in each movie. All of the baseline highlights are manually selected by movie audiences. We obtained the edited highlight moment video on the imdb and bilibili websites, and matched it with the original movie to obtain the highlight time. The baseline highlights of some movies in the dataset are listed in Table 3. We chose one movie from each category, and we can find the movie name, highlight number, and highlight playback time in Table 3. In the experiments, we used two metrics to measure the performance of sentiment highlight extraction strategy, which are, (1) Sentiment highlight F1 score, calculated by equation (2) Overlapped number count, which is the number of overlapped fragments between highlights extracted by our proposed approach and the baseline highlights.

Evaluation of Sentiment Highlights Extraction
In the experiments, the highlight threshold, δ ∈ [0, 1), and linking threshold, θ ∈ [0, 1), are two adjustable parameters. After a number of experiments, results show that θ has little effect on sentiment highlight extraction. Therefore, θ is set 0.1 in the experiments. In order to obtain the optimal value of δ, we calculated the average F1 score and overlapped number count under different δ. We used Latent Dirichelet Allocation (LDA) and BERT, respectively, to construct TSC vectors in our method. The main parameters in the LDA model are as follows: the number of theme sampling iteration η = 100; and the quantity of hidden topics K = 100. The main parameters in the BERT model are as follows: the number of hidden layers is 12; the hidden size is 768; and the number of attention heads is 12. We compared our method with three methods: (1) randomly selected fragments, (2) Multi-Topic Emotion Recognition (MTER) [5], and (3) the method proposed by Ping [6]. We also compared our methods using different ways for constructing TSC vectors and the method without the step in section find video highlights. The overlapped number count is the sum of overlapped number from these methods. Figure 4 shows the experiments results of the sentiment highlight extraction strategy. As we can see in the figure, our proposed strategy has the highest average F1 score and highest overlapped number count when δ = 0.2; in the other words, our model has the optimal extraction effect at δ = 0.2. Therefore, we set δ = 0.2 and θ = 0.1 in the following experiments.  Table 4 shows the sentiment highlight F1 score for these sentiment highlight extraction methods. The optimal value of each row is shown in bold. From the experimental results, it can be seen that, for different categories of movies, the experimental method has better experimental results with comedies and dramas, because the highlights of these movies are more concentrated, while, for action, horror and thriller movies, the experimental results are lower. On one hand, the sentiment type is relatively simple and single in comedy movies. Audiences have the same feeling when they watch happy clips. There is an agreement on the understandings of the clips. Therefore, the happy clips can be easily extracted as highlights, which makes the F1 score of the comedy genre higher than that of other genres. On the other hand, there are a greater number of various scenes in other genres, such as fighting in action movies, and jump scares in horror and thriller movies. The sentiment types are various and complex in those genres. It makes different audiences have different understandings, even when they watch the same scenes. Therefore, movies of those genres achieve a lower F1 score compared with comedy movies. From the experimental results, we can see that our method with BERT has a higher F1 score than other methods with regards to action-adventure movies, comedies, fantasy movies, crime movies, and drama movies. This shows that our method has better universality for different categories of movies. However, in the genre of horror and thriller movies, the experimental results of our method are slightly worse than those proposed by Ping [6]. We speculate that this may be because people will be full of tension when watching horror and thriller movies, and there is a larger latency. Meanwhile, the values of the F1 score in our method are all greater than 0.5, while the method proposed by Ping performs poorly on some movies, such as Slumdog Millionaire. Therefore, our method performs more stably with respect to the method proposed by Ping. In summary, we find that the experimental results of our proposed method are better than MTER [5] and Ping [6]. The experimental results indicate that our method has a better performance with higher overall accuracy and F1 score than other methods. The experimental results show that our method has good results with different categories of movies, and has certain universality for various categories of movies.
As can be seen from Table 5, the average value of our method with the BERT overlapped number is better than other methods. We randomly selected one movie of each type from Table 2. The overlapped number of the movie is visually displayed in Figure 5. Figure 5 demonstrates that the overlapped number of our method with BERT is higher than other methods in most movies.  The results of our experiments show that different types of movies yield different results. Specifically, we found that the movie genre affects the emotional response of audiences, which in turn impacts the use of emotional words in the TSCs. For instance, the Charlie Chaplin and Secret Superstar overlapped numbers of these methods are high while the Pacific Rim overlapped numbers of these methods are low. Charlie Chaplin is a comedy movie with a relaxing and cheerful emotional tone, which increases the probability of audiences using straightforward emotional words such as "2333", "funny", and "interesting". Similarly, in Secret Superstar, a movie with a profound conceptual theme, the plot twist can elicit a strong emotional response from audiences, leading them to express straightforward emotional words more frequently.
In contrast, for action movies such as Pacific Rim, audiences tend to pay more attention to fight scenes and special effects rather than the emotional content of the movie, resulting in a lower probability of using similar, straightforward emotional words. As a result, we observed a higher overlap in the emotional words used by audiences for Charlie Chaplin and Secret Superstar, and a lower overlap for Pacific Rim.
To investigate the influence of various similarity measures on experimental results, we conducted experiments using different similarity measures in conjunction with BERT. The employed measures encompassed the Euclidean distance, Pearson correlation coefficient, Manhattan distance, Minkowski distance, and cosine similarity. The experimental results, shown in Table 6, indicate that the employment of cosine similarity demonstrates a higher average F1 score and average overlapped number. Based on these results, we selected cosine similarity as the preferred measure for our method.

Evaluation of Sentiment Intensity Calculation
We randomly selected one movie of each type from Table 2 to show the experimental results of sentiment intensity. The experimental results of sentiment intensity information are listed in Table 7 after normalization (for each movie, three highlights are listed).  In Table 7, we can find the representative sentiment highlight information. We can see that, for different categories of movies, the distribution of sentiment on highlighted clips is not the same. In addition, these emotional distributions match our impressions of these movies. For instance, Charlie Chaplin is a comedy. The movie's emotional fundamental key is relaxing, so the good dimension value is much higher than other dimensions. Furthermore, the sentiment intensity of Secret Superstar distributes on each dimension much more evenly instead of focusing on the same dimension. This is also in line with our expectations. Secret Superstar is a movie with various sentiments, which means its sentiment is complicated, and audiences may have quite different views on the same sentiment highlight.
To evaluate performance of the strategy of sentiment intensity calculation, we invited three experts who are professional in movie appreciation to label the sentiment intensity for each sentiment highlight.
After comparing sentiment highlights and intensities with their corresponding movie shots and origin TSC data, we found that our sentiment intensity can describe the sentiment information for sentiment highlights very well.

Conclusions and Future Work
In this paper, a time-sync-comments-based sentiment analysis model aimed at extracting sentiment highlights from videos and measuring sentiment intensity for highlights using TSCs is proposed. A four-step approach to extract video highlights and a strategy for calculating sentiment intensity are proposed, enabling the quantitative assessment of sentiment within these video highlights. The experimental results not only show that our approach improves the F1 score by 12.8% and overlapped number by 8.0% compared with the best existing method in highlight extraction, but also indicate a sentiment distribution in line with the corresponding movie scenes. Moreover, the proposed approach can be widely used for TSCs in various language. Strategies of sentiment highlight extraction and sentiment intensity calculation proposed in this paper focus on Chinese TSCs, but they can work on other languages by replacing grammar rules and sentiment analysis methods in other languages.
In the future, prior knowledge will be considered in highlight extraction strategy in order to improve the performances for those movie genres such as action, horror and thriller movies. Then, the sentiment dictionary will be continuously extended to increase the performances of sentiment intensity calculation.