Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms

Featured Application: This work can be widely used in advanced intelligent systems, including smart city, smart transportation and smart home, etc. Abstract: Video description plays an important role in the ﬁeld of intelligent imaging technology. Attention perception mechanisms are extensively applied in video description models based on deep learning. Most existing models use a temporal-spatial attention mechanism to enhance the accuracy of models. Temporal attention mechanisms can obtain the global features of a video, whereas spatial attention mechanisms obtain local features. Nevertheless, because each channel of the convolutional neural network (CNN) feature maps has certain spatial semantic information, it is insu ﬃ cient to merely divide the CNN features into regions and then apply a spatial attention mechanism. In this paper, we propose a temporal-spatial and channel attention mechanism that enables the model to take advantage of various video features and ensures the consistency of visual features between sentence descriptions to enhance the e ﬀ ect of the model. Meanwhile, in order to prove the e ﬀ ectiveness of the attention mechanism, this paper proposes a video visualization model based on the video description. Experimental results show that, our model has achieved good performance on the Microsoft Video Description (MSVD) dataset and a certain improvement on the Microsoft Research-Video to Text (MSR-VTT) dataset.


Introduction
Video description is widely used in advanced intelligent technology, including smart city, smart transportation and smart home [1][2][3][4][5]. Video description technology is a part of computer vision and natural language processing, which has attracted much attention in recent years [6][7][8][9]. In 2014, Venugopalan [10,11] proposed a video description model based on the framework of "encoding-decoding." The encoding method in his model extracted features from a single frame of video by using CNN, then adopted the mean pooling and time series encoding models, respectively. Although the Venugopalan model proposed has been applied successfully in video description, there are also some problems with this model.
The first problem is that video features are not utilized effectively. Video features are only used in the first decoding and not used subsequently, thus reducing the ability of video features to predict words when time-series increase [10,11]. Therefore, the capability of sentence generation decreases. The second problem is the consistency of visual content features and sentence descriptions. In the first problem, application of the temporal attention mechanism increases the utilization of video 2 of 18 features. However, this approach does not model the relationship between video features and sentence descriptions [12,13]. Therefore, it brings about the second problem that is how to ensure the consistency of visual content features and sentence descriptions.
For these problems, one solution is to add the video feature each time. However, the video feature consists of multiple images. If we still use the pooling encoding method to send the video feature into the decoding model each time, then the video feature will not be utilized effectively. Xu [12] proposed an image description model based on attention mechanism. The model weighted every region of each image by using the attention mechanism before each word-predicting process, making the feature used in each prediction different. Based on this idea, Yao [13] proposed a video description model based on a temporal attention mechanism. Their model weighted the features of all video frames and summed them whenever making word prediction. Experimental results showed that the video feature was utilized effectively.
In this paper, we propose a video description model based on temporal-spatial and channel attention to solve the abovementioned problems. At present, the most effective way to extract image features is the convolutional neural network (CNN) [14,15]. For an image with a size of w × h (w represents the width and h represents the height), by processing it with CNN, we obtain a new coding feature with a size of w × h × c (c represents the new feature obtained by CNN). The convolution kernel can detect different features of an image. In general, the convolution kernel at lower layers can detect information such as edge texture and the convolution kernel at higher layers can detect features with semantic information. Therefore, the feature map obtained from an image through CNN contains spatial and semantic information. However, most existing models only focus on the attention mechanism of time or space [10]. Thus, the substantive feature of the CNN network is not fully utilized. However, our multi-attention video description model introduces the channel attention mechanism on the foundation of a traditional temporal and spatial attention mechanism. Besides, this model makes a stronger combination of visual features and sentence descriptions so that the accuracy of the model is increased. The model is experimented on the datasets Microsoft Video Description (MSVD) and Microsoft Research-Video to Text (MSR-VTT). Further, BLEU@4 [16], CIDEr [17], METEOR [18], and ROUGE_L [19] are adopted as evaluation indexes. Experimental results verified the effectiveness of our model. Then a video visualization model is proposed based on video description. In this video visualization model, we made a visual analysis of our attention mechanism and proved the accuracy of the model intuitively.

Attention Mechanism
The video description has achieved a breakthrough with the combination of deep learning [20][21][22][23]. Meanwhile, the technique based on visual attention mechanism has also been successfully applied in the video description model and it solves the first problem mentioned above.
The visual attention mechanism is widely applied in image and video description tasks because human vision does not process the entire visual input at once. Instead, human vision only focuses on the information of crucial parts. Based on this reasonable hypothesis, current description models usually do not use static coding features of an image or a video. Instead, they use an attention mechanism and sentence context information to extract image features. Therefore, visual attention is an encoding mechanism that extracts features dynamically based on context information over the whole time-series. At present, the attention mechanism is mainly based on time and space. We would improve our model by utilizing these two factors and take advantage of CNN. Then, we will introduce the idea of a channel attention mechanism.

Temporal Attention
Yao [13] was the first person to propose a video description model based on a temporal attention mechanism. In a video, each frame contains information at a different time, so the visual feature of each frame is a relatively complete semantic information expression. It expresses global time-series Appl. Sci. 2020, 10, 4312 3 of 18 information, such as the order in which objects, actions, scenes, and characters appear in a video. This kind of information may exist throughout a video. A good video description model should focus on important moments in a video sequence. With the help of temporal attention mechanism, we can decide the importance of different frames.

Spatial Attention
Apparently, for a video, it has not only global temporal features, but also local temporal features. These local temporal features are usually the representations of the fine-grained features of actions in a video, such as standing up, calling someone, etc. These features are also the fine-grained features for identifying important regions of an image, such as the region of the face in an image. These features only appear in certain regions of a certain time-series in a video generally. If we do global pooling to all video frames, then the ability to capture local information will decrease. The spatial attention mechanism enables the model to acquire local features. Xu [12] proposed the idea of applying a spatial attention mechanism in the video description.

Spatial Attention
After an image is processed with the CNN, each layer of the CNN will produce different numbers of feature graphs, i.e., channel features. The experiment done by Zeiler [24] showed that the feature map of different layers of the CNN can show different semantic information. Specifically, low-level feature maps show low-level visual features, such as texture and color, whereas the high-level feature maps show high-level semantic features such as objects with different spatial features. Because the CNN contains many convolution kernels that can detect different features, we can obtain different feature maps. Figure 1 is the visualization result of a part of low-level and high-level feature maps after an image entered the CNN. In this figure, the low-level features focus on some edge contour information of the image (Figure 1b), whereas the high-level features prefer to express semantic information of the image ( Figure 1c). As shown in Figure 1, the brightness of the feature represents the response value to the original image. While it is brighter, the response value is higher. We can see that the Figure 1c shows the response of the feature channel in the same layer after the image is input into the CNN network. It can be seen from the figure that the response positions of different feature maps are different and that these different response positions express different semantic information prominently. If the response position of some feature maps is on the person, then the region with people will have a larger regional response value. Some feature maps are responses to other objects. Inspired by this visualization result, for obtaining video feature information consistent with description words of a video in video descriptions more effectively, this paper proposes a channel attention mechanism. We can enhance the ability of the model to focus on the feature maps in need by using channel attention mechanism and then improve the description capability of the model.

Network Architecture
In the previous section, we introduced temporal, spatial, and channel attention mechanisms and their corresponding extraction abilities for different features in a video. To make full use of capabilities of attention mechanisms, we propose a video description model based on the temporalspatial and channel attention mechanisms in this section. The fundamental network architecture is shown in Figure 2. This model utilizes different attention mechanisms. For example, when predicting the words in need, this model will combine the features effectively in the video by the attention mechanism automatically, then generate the appropriate corresponding words; thus, it is a solution to the problem related to the consistency of visual content features and sentence descriptions.

Network Architecture
In the previous section, we introduced temporal, spatial, and channel attention mechanisms and their corresponding extraction abilities for different features in a video. To make full use of capabilities of attention mechanisms, we propose a video description model based on the temporal-spatial and channel attention mechanisms in this section. The fundamental network architecture is shown in Figure 2. This model utilizes different attention mechanisms. For example, when predicting the words in need, this model will combine the features effectively in the video by the attention mechanism automatically, then generate the appropriate corresponding words; thus, it is a solution to the problem related to the consistency of visual content features and sentence descriptions. network mainly uses the CNN to extract features of video frames. The multi-attention network comprises two parts: an attention network and a feature fusion. Therefore, the attention network can utilize the video features and the previous output to calculate the attention weight of the video feature in time, space, and channel respectively, then recalculate the fused features by the obtained weights and the video features. In this way, we can get more useful features. Finally, the fused features will be exploited by the decoding network for encoding output to obtain more consistent description with video content.
By extracting visual features effectively in time, space, and channel, the multi-attention description model proposed in this paper enhances the representation ability of the model. The three attention models we propose are relatively independent. We can combine three models randomly to research the different effects of various combinations. An attention model named S-C-T (spatialchannel-temporal) is shown in Figure 2.
Φ covers a series of equations in the Section 3.2. We can obtain ( , , ) from ℎ and V, where ϵ * , ϵ , ϵ . The corresponding formula is presented in detail in the next section. Then, we apply the three weight factors to V and obtain the features needed to be input into the LSTM: As shown in Figure 2, the multi-attention model proposed in this paper consists of three parts: a video encoding network, a multi-attention network, and a decoding network. The video encoding network mainly uses the CNN to extract features of video frames. The multi-attention network comprises two parts: an attention network and a feature fusion. Therefore, the attention network can utilize the video features and the previous output to calculate the attention weight of the video feature in time, space, and channel respectively, then recalculate the fused features by the obtained weights and the video features. In this way, we can get more useful features. Finally, the fused features will be exploited by the decoding network for encoding output to obtain more consistent description with video content.
By extracting visual features effectively in time, space, and channel, the multi-attention description model proposed in this paper enhances the representation ability of the model. The three attention models we propose are relatively independent. We can combine three models randomly to research the different effects of various combinations. An attention model named S-C-T (spatial-channel-temporal) is shown in Figure 2.
First, from Video I, we can extract the network features after operating it in CNN: . . , v n } indicates that the video has n frames with a dimension of K * K * C, and the video feature of every frame is v i ∈ R K * K * C . At time t, we use the decode unit, LSTM (Long Short-Term Memory), to get the output h t−1 at the previous time. With known values of h t−1 and V, the unknown attention weight factors (α, β, γ) of time, space, and channel can be calculated by Equation (2): Φ covers a series of equations in the Section 3.2. We can obtain (α, β, γ) from h t−1 and V, where α R K * K , β R C , γ R N . The corresponding formula is presented in detail in the next section. Then, we apply the three weight factors to V and obtain the features needed to be input into the LSTM: 6 of 18 where f represents the output of three attentional operations on V. Section 3.2 is the detailed introduction to Equation (3). Then, we can start to predict the next word: where w t−1 represents the word vector of the corresponding word y t−1 and p t ∈ R |D| is the probability distribution of words. In this model, there is a word vector space W D ∈ R D * S , where D is the number of words and S is the dimension of the word vector. Supposing we have M videos, the number of words in the sentence is T, and the loss function of the whole model is Equation (6), whose mathematical meaning is a maximum likelihood estimation: where Ω represents the parameters that can be trained, including word vectors. According to the formula, the loss of the model consists of two parts. L y is the loss of predicting the consistent word. Due to the high complexity of the model, the model is prone to overfitting. λ is the regularization coefficient, and θ is the training model data. L λ is the loss of regular function. To prevent the model from overfitting and control the complexity of the model, we add a second loss to regularize the model parameters.
In the training process of the whole model, our primary objective is its optimization. By using the backpropagation algorithm to update the parameters and minimize the loss function, we can achieve an optimum model.

Attention Calculation
From the previous section, we know that this paper has three attentions: temporal attention, spatial attention, and channel attention. The three attentions are relatively independent, so we can change the order in which they act on the video features. As shown in Figure 2, we apply the spatial attention on video features first and then apply the channel attention. It will be introduced in detail in the following section.

From a video, CNN can extract the video features
where r ij represents the j region feature of the i frame with a dimension of C. The spatial attention for the ij region at time t can be formulated as follows: Appl. Sci. 2020, 10, 4312 where W, U, and b are network weight parameters that can be learned. α ik 2 . f att−s means applying a special attention operation on all video frames as in (9)-(11); finally, the video features through spatial attention are obtained: X t att−s = {v 1−s , v 2−s , . . . , v n−s }, whose dimension is the same as V.

Channel Attention
Through the space attention, we obtain new video features: where u i ∈ R K * K * N and C denotes the number of feature channels. Then we can adopt the average pooling operation and obtain the channel feature vector of the video: where the scalar c i represents the mean value of the vector u i , which denotes the channel feature value. The attention can be calculated by Equations (13) where ⊗ is the symbol of the outer product and ⊕ denotes the addition of matrices and vectors. The dimension of β is C and we apply β on v i−s : f att−c indicates applying the channel attention operation on all v i−s , as in Equations (13)- (15) and we obtain the video features after channel attention operation:

Temporal Attention
After the feature processing of the spatial attention and channel attention, the video feature is Then, we adopt a pooling operation on it. Finally, we get V = v 1 , v 2 , . . . , v n with a dimension of C. After obtaining the features of the video sequence, we can weigh the features by temporal attention to obtain the whole features expression of the whole video. Suppose i is the weight of the video feature of the frame i when the model predicts words at time t, which satisfies n i=1 γ i can be computed from the output of the previous time h t−1 and the video feature V: Then, for the i frame, the result is: f att−t means the operation of temporal attention for all frames as in Equations (17)- (19) and we obtain the video features after the temporal attention operation: Appl. Sci. 2020, 10, 4312 8 of 18 After the operations in spatial, channel, and temporal domain, we get the feature X (t) att−t . After weighting the X (t) att−t , the feature z that needs to be sent to the decoding network is obtained, where the pool indicates the space pooling operations: Furthermore, the dimensions of X (t) att−s , and V are the same. X In the video features, the temporal-spatial features are usually regarded as a holistic feature, so the ST-C and C-ST models represent the temporal-spatial features as a whole for modeling, and the calculation principle is the same with the first six models. We will analyze the results of these models in the experiment.

Attention Visualization
This paper attempts to adopt a top-down analysis on the model with the foundation of existing video description models and proposes significance analysis based on the visual video description. In detail, in an existing video description model, when we get a video and the corresponding description, we want to use the model directly to establish the correspondence between the input object and the words in the sentence, then output the saliency map of the video. The basic "encoding-decoding" model is used in this paper. As shown in Figure 3, the front h 1 to h m is the encoding network, which is the encoding network of the attention model proposed in the paper. Its main principle is to measure the significance level by figuring out the loss of information when using local visual inputs to approximate the entire input sequence. In the video features, the temporal-spatial features are usually regarded as a holistic feature, so the ST-C and C-ST models represent the temporal-spatial features as a whole for modeling, and the calculation principle is the same with the first six models. We will analyze the results of these models in the experiment.

Attention Visualization
This paper attempts to adopt a top-down analysis on the model with the foundation of existing video description models and proposes significance analysis based on the visual video description. In detail, in an existing video description model, when we get a video and the corresponding description, we want to use the model directly to establish the correspondence between the input object and the words in the sentence, then output the saliency map of the video. The basic "encodingdecoding" model is used in this paper. As shown in Figure 3, the front h to h is the encoding network, which is the encoding network of the attention model proposed in the paper. Its main principle is to measure the significance level by figuring out the loss of information when using local visual inputs to approximate the entire input sequence.  Suppose that the probability distribution of words ( ) is predicted by LSTM at each moment in the prediction process. We assume that the probability distribution is a "true" distribution. To measure how much information the i-th local description feature of the video brings to the word at time t, we only use the i-th local description feature input encoder and calculate the probability distribution ( ) at time t after encoding and decoding. Then, we calculate its KL divergence as a measure of information loss: Suppose that the probability distribution of words p(y t ) is predicted by LSTM at each moment in the prediction process. We assume that the probability distribution is a "true" distribution. To measure how much information the i-th local description feature of the video brings to the word at time t, we only use the i-th local description feature input encoder and calculate the probability distribution q i (y t ) at time t after encoding and decoding. Then, we calculate its KL divergence as a measure of information loss: p(y t ) = P(y t y 1:t−1 , v 1:m ) (22) q i (y t ) = P(y t y 1:t−1 , v i )

Model
In the input sentence S, because what we input is the true distribution of words, the probability distribution of words should satisfy "one-hot" distribution. The above equation can be simplified as follows: Therefore, we get the information loss of the predicted word w in the i-th local feature description of the video at time t, and this value represents the connection between the visual features and the words. Hence, we can use it as a saliency value: The range of the value is 0-1. When the value is higher, the saliency is stronger.

Datasets and Evaluation Metrics
The model is experimented on the datasets Microsoft Video Description (MSVD) [25] and Microsoft Research-Video to Text (MSR-VTT) [26]. The MSVD dataset contains 1970 short videos, 1200 videos for the training set, 100 videos for validation set, and 670 videos for the testing set. The dataset MSR-VTT contains 6513 videos for the training set, 497 videos for validation set, and 2990 videos for the testing set. BLEU@4 [16], CIDEr [17], METEOR [18], and ROUGE_L [19] are adopted as evaluation indexes.

Experiment Setting
In our experiment, we sampled uniformly and set 26 frames as an interval group in a video. If a video was less than 26 frames, then we would fill the video with "zero-pad". For the corpus in the dataset, we used the NLTK (Natural Language Toolkit) [27] to tokenize and process the sentence. When word frequency is too low, it is difficult for the model to learn. At the same time, according to our statistics, the proportion of words with a word frequency of less than 4 is relatively small in the entire data set. For example, this proportion is only 1.69% in the MST-VTT data set. There are still some misspelled words or words with no specific meaning in the dataset, which are all incorrect words. The presence of these incorrect words will make it difficult for the model to describe the content of the video. Thus, words with a frequency less than four or incorrect would be eliminated. This is also a general processing method in video description [28]. Meanwhile, three specific words (<start>, <end>, and <unknown>) are added to represent the beginning of the model, the end of the model, and the unknown identifier, respectively. Finally, the word dictionary is obtained. For each description sentence, we would fix the length of the sentence to 20. If the length of the sentence were less than 20, then we would fill it with ending characters.
The Inception-V3 [29] served as a CNN for extracting video features in the model. We used the output of "pool3" in the Inception-V3 as the video features with a dimension of 8 × 8 × 2048. We only used the Inception-V3 to extract video features, and we did not train the model parameters whose input was 229 × 229. The word representation used in our model is word embedding. In this model, the dimension of the hidden layer of LSTM was 1024, the word vector dimension of words was 512, and the time steps of LSTM were 20. The Adam [30] algorithm is utilized to optimize the model training process with an initial learning rate of 0.001.

Analysis of Different Attention Combinations
The multi-attention video description model we proposed in this paper is shown in Figure 3 and contains temporal, spatial, and channel attention mechanisms represented by T (temporal), S (spatial), and C (channel), respectively. We can combine the three attention mechanisms to build six basic models: T-S-C, T-C-S, S-T-C, S-C-T, C-T-S, and C-S-T. Besides, because the information in the time domain and the space domain are often correlated in a video, we consider the space and the time as a whole so that we build two models: ST-C, and C-ST. Therefore, our model has eight different combinations in total. Venugopalan et al. [10] proposed the Base Model. The experimental results on MSVD are shown in Table 1. The experimental result is the statistical value of multiple experiments. All the test results were obtained by a simple greedy search method. The results show that the S-C-T model has the best results on CIDEr, METEOR and ROUGE_L evaluation indexes, and the ST-C model has the best result on BLEU-4 evaluation index. Therefore, the S-C-T model was the best one in the first six models while the C-S-T model performed worst. This difference indicated that the model focused more on certain spatial region features of video frames in the video description model, and improving the effectiveness of the model by obtaining effective spatial features. Similarly, when we considered the space and the time as a whole, the results of ST-C performed well on certain performance indexes. According to the results, if the model initially focused on the channel attention, then the results were relatively poor among all the models. Because the channel attention focused on certain semantics of the local features in the video, which indicated that focusing on the global feature initially, the local features helped improve the performance of the model. Table 2 shows the results of the S-C-T model on Greedy Search and Beam Search. For the Beam Search algorithm, the test results of K from 2 to 10 are shown in this table. The experimental result is the statistical value of multiple experiments It can be seen that the Beam Search algorithm has a significant improvement on BLEU-4, CIDEr, and METEOR, and a small improvement on ROUGE_L. This shows that the Beam Search algorithm has better results. Because compared to the Greedy Search algorithm, the Beam Search algorithm has more search paths and explores more possible sentence sequences. However, as the K increases, the calculation time of Beam Search algorithm increases, and meanwhile, the result tends to be stable. Therefore, it is necessary to select an appropriate K to achieve a balance in time and performance, and usually the K is selected to be 6 or less. As Table 2, when K = 6, the S-C-T model can obtain the best results in this paper.

Comparison with Methods in Other Papers
To verify the effectiveness of the model, we compared our methods with other methods in existing relevant articles, including basic Seq-Seq (S2VT) [11], joint embedded model (LSTM-E) [31], temporal attention model (TA) [13], hierarchical model (HRNE-SA) [32], paragraph description model (P-RNN) [33], task-driven dynamic model (TDDF) [34], multi-level attention model based RNN (MAM-RNN) [35], LSTM-GAN model [7], video captioning model with tube features (VCTF) [36], VD-SVOs model [37], and spatial-temporal attention mechanism (STAT) [38]. Among them, the joint embedded model, HRNE-SA, and P-RNN used the average pooling, spatial attention, and temporal-spatial attention, respectively. To compare with other models effectively, this model utilized the Beam Search algorithm, where K is 6. Table 3 shows that our S-C-T model was the best on CIDEr and METEOR, and consistent with the best method TDDF on the ROUGE-L. Compared with STAT, although our S-C-T model is weaker than STAT on BLEU-4, it is much stronger than STAT on CIDEr index, and slightly better on METEOR. However, STAT uses image features, motion features and local features, while we only use image features, which makes our S-C-T model easy to implement. Furthermore, when STAT only employ image features (STAT_GlobFeat), our model is superior to STAT on CIDEr and METEOR, and almost consistent in BLEU-4. One prominent progress in our experiment was the significant improvement of CIDEr compared with others. Meanwhile, different from other indexes, CIDEr is exclusively used in image video description index and it refers to the punctuation of words, the accuracy of word order, the accuracy of semantic and content descriptions, and fluency synthetically. Therefore, this experiment verified the effectiveness of our multi-attention model on taking advantage of the visual features and solving the problem of consistency between visual features and the semantic description. To analyze the model on a larger dataset, we evaluated the model on the MSR-VTT [25] dataset. The result is shown in Table 4. Among the models in Table 3, only TDDF, LSTM-GAN and STAT were tested on MSR-VTT, thus our paper only compares with these models. As Table 4, our model has an improvement on BLEU-4 and METEOR, and has a little improvement on ROUGE-L. However, the results of CIDEr show significant differences on the MSVD and MSR-VTT datasets. For the MSVD dataset, the CIDEr score of our model is significantly higher than TDDF and STAT, but lower than them on MSR-VTT dataset. This result can be analyzed from the dataset. The MSVD dataset has fewer videos than the MSR-VTT dataset, but each video has 40 description statements, whereas the MSR-VTT dataset has only 20 description statements. Our model mainly obtains the effective features of the video from the temporal, spatial, and channel attention, and the single video (which has more descriptions) helps the feature acquisition ability of the attention mechanism. Therefore, our model has better performance for more diverse descriptions. At the same time, the base feature of our model only use image features, whereas TDDF uses both image features and motion features and STAT uses image features, motion features and local features, which is beneficial for datasets with more action categories like MSR-VTT. Some results of video descriptions are shown in Figure 4. We can see that for the first two scenarios, the model gives a very appropriate statement to describe the content of the video objectively. A bad example is also shown in this figure. The statement description is "a man is singing a a a", which shows that the statement has an obvious error. The model cannot give the correct word after the article "a," which may be a word similar like "song." This may be due to the lack of attention mechanism for the ability to model abstract nouns such as ""song" that cannot express specifically.  Some results of video descriptions are shown in Figure 4. We can see that for the first two scenarios, the model gives a very appropriate statement to describe the content of the video objectively. A bad example is also shown in this figure. The statement description is "a man is singing a a a", which shows that the statement has an obvious error. The model cannot give the correct word after the article "a," which may be a word similar like "song." This may be due to the lack of attention mechanism for the ability to model abstract nouns such as ""song" that cannot express specifically.

Visual Analysis and Validation of Attention Mechanism
To understand the correlation between visual features and semantic features intuitively, we explored the relationship between the video description and the video content and attempted to analyze their inner links visually. Finally, we proposed a visualization model based on the video description.
We selected a video in the MSVD and used the method introduced in the previous sections to obtain the salient region of each word in the sentence description model. In Figure 5, it was divided into three parts: the beginning of the video, the body of the video, and the end of the video.

Visual Analysis and Validation of Attention Mechanism
To understand the correlation between visual features and semantic features intuitively, we explored the relationship between the video description and the video content and attempted to analyze their inner links visually. Finally, we proposed a visualization model based on the video description.
We selected a video in the MSVD and used the method introduced in the previous sections to obtain the salient region of each word in the sentence description model. In Figure 5, it was divided into three parts: the beginning of the video, the body of the video, and the end of the video. Additionally, each part displayed four images. The descriptive sentence was "a cat chases a dog." Every picture had the same five video frames with a descriptive word representing the salience response of the descriptive word in the image over time in the top left corner. Furthermore, the luminosity of the salient region had a positive relationship with the intensity of salience response in that region.
In Figure 5, our model responded effectively to nouns such as "cat" and "dog". Therefore, our model can differentiate nouns well. Additionally, the strongest salience responses to word "cat" were concentrated in the cat. Likely, the salience response to word "dog" is accurate, even with the video running. For articles like "a", because it is not very relevant to the vision, all regions in the video are equally treated, so there would be no salient regions. Similarly, verbs like "'chases", there was no region of salience response initially. However, with the movement of the cat and the dog in the video, some regions of the video frame began to receive responses. Therefore, our model can capture the verbs with temporal-spatial continuity to a degree.
We selected representative images from the dataset. In Figure 6, the figure displays the salience response of the model to human, animal, and objects, respectively. In this picture, it can be seen that our model realized excellent salience responses to different categories and the model can distinguish between different categories. For example, when multiple categories appear in a figure, this model can generate the corresponding response properly.
Additionally, each part displayed four images. The descriptive sentence was "a cat chases a dog." Every picture had the same five video frames with a descriptive word representing the salience response of the descriptive word in the image over time in the top left corner. Furthermore, the luminosity of the salient region had a positive relationship with the intensity of salience response in that region.
In Figure 5, our model responded effectively to nouns such as "cat" and "dog". Therefore, our model can differentiate nouns well. Additionally, the strongest salience responses to word "cat" were concentrated in the cat. Likely, the salience response to word "dog" is accurate, even with the video running. For articles like "a", because it is not very relevant to the vision, all regions in the video are equally treated, so there would be no salient regions. Similarly, verbs like "'chases", there was no region of salience response initially. However, with the movement of the cat and the dog in the video, some regions of the video frame began to receive responses. Therefore, our model can capture the verbs with temporal-spatial continuity to a degree.
We selected representative images from the dataset. In Figure 6, the figure displays the salience response of the model to human, animal, and objects, respectively. In this picture, it can be seen that our model realized excellent salience responses to different categories and the model can distinguish between different categories. For example, when multiple categories appear in a figure, this model can generate the corresponding response properly.
Meanwhile, in Figure 7, the image contains examples of verbs. For example, there is a verb "play" in Figure 7, Row 1. We can see the response to "play" is mainly concentrated around the guitar, indicating that the model has a certain response ability to the verb.
The salience visualization analysis of the video description illustrates the correlation between visual features and the sentence description. Furthermore, it validated the consistency of visual features and sentence descriptions, and established a corresponding effective model.

Conclusions
In this paper, we proposed a video description model based on temporal-spatial and channel attention. In detail, we fully utilized the essential characteristics of CNN and added channel features into the attention mechanism of the model. Therefore, the model can use visual features more effectively and ensure the consistency of visual features and sentence descriptions to enhance the effect of our model. Moreover, our experimental results show that the model has achieved good performance on MSVD dataset. To analyze the model on a larger dataset, we evaluated the model on the MSR-VTT dataset. The results show that our model also has a certain improvement on the MSR-VTT dataset. We also proposed a video visualization model based on the video description and visually demonstrated its effectiveness. Meanwhile, in Figure 7, the image contains examples of verbs. For example, there is a verb "play" in Figure 7, Row 1. We can see the response to "play" is mainly concentrated around the guitar, indicating that the model has a certain response ability to the verb.

Conclusions
In this paper, we proposed a video description model based on temporal-spatial and channel attention. In detail, we fully utilized the essential characteristics of CNN and added channel features into the attention mechanism of the model. Therefore, the model can use visual features more effectively and ensure the consistency of visual features and sentence descriptions to enhance the effect of our model. Moreover, our experimental results show that the model has achieved good performance on MSVD dataset. To analyze the model on a larger dataset, we evaluated the model on the MSR-VTT dataset. The results show that our model also has a certain improvement on the MSR- The salience visualization analysis of the video description illustrates the correlation between visual features and the sentence description. Furthermore, it validated the consistency of visual features and sentence descriptions, and established a corresponding effective model.

Conclusions
In this paper, we proposed a video description model based on temporal-spatial and channel attention. In detail, we fully utilized the essential characteristics of CNN and added channel features into the attention mechanism of the model. Therefore, the model can use visual features more effectively and ensure the consistency of visual features and sentence descriptions to enhance the effect of our model. Moreover, our experimental results show that the model has achieved good performance on MSVD dataset. To analyze the model on a larger dataset, we evaluated the model on the MSR-VTT dataset. The results show that our model also has a certain improvement on the MSR-VTT dataset. We also proposed a video visualization model based on the video description and visually demonstrated its effectiveness.