Video Captioning Based on Channel Soft Attention and Semantic Reconstructor

: Video captioning is a popular task which automatically generates a natural-language sentence to describe video content. Previous video captioning works mainly use the encoder–decoder framework and exploit special techniques such as attention mechanisms to improve the quality of generated sentences. In addition, most attention mechanisms focus on global features and spatial features. However, global features are usually fully connected features. Recurrent convolution networks (RCNs) receive 3-dimensional features as input at each time step, but the temporal structure of each channel at each time step has been ignored, which provide temporal relation information of each channel. In this paper, a video captioning model based on channel soft attention and semantic reconstructor is proposed, which considers the global information for each channel. In a video feature map sequence, the same channel of every time step is generated by the same convolutional kernel. We selectively collect the features generated by each convolutional kernel and then input the weighted sum of each channel to RCN at each time step to encode video representation. Furthermore, a semantic reconstructor is proposed to rebuild semantic vectors to ensure the integrity of semantic information in the training process, which takes advantage of both forward (semantic to sentence) and backward (sentence to semantic) ﬂows. Experimental results on popular datasets MSVD and MSR-VTT demonstrate the effectiveness and feasibility of our model.


Introduction
Video captioning is a popular and challenging task. It involves both computer vision and natural-language processing. Automatic video caption generation has many practical applications. For example, it could help improve the quality of online video indexing and searching. As another example, in combination with speech synthesis technology, describe videos with natural language could help the visually impaired to understand video contents.
The most important part of video captioning is to extract a precise video representation. As the video is made up of continuous images, which is a spatial-temporal extension of the pixel, how to obtain video representation with spatial-temporal information is a crucial part of video captioning. Early methods such as Dense Trajectories (DT) [1] and Improved Dense Trajectories (iDT) [2], the spatial-temporal descriptors are first made by handcrafting and then encoded through Vector of Locally Aggregated Descriptors (VLAD) [3] or Fisher Vector [4] to form the final video representation. Deep learned descriptors have shown their strong competitiveness in video tasks after image classification made a great improvement [5]. Xu et al. [6] propose a framework Sequential VLAD (SeqVLAD), in which the descriptor encoding is learnable, and the temporal information can be detailly aggregated by using an improved recurrent convolutional network (RCNs). Although their framework has achieved great success in video feature aggregation, it has ignored the temporal structure of the feature map of each channel, which provides global nuances of each channel.

•
A deep learning framework for video captioning is proposed, which optimizes the Sequential VLAD encoding by channel soft attention and leverages the semantic information by a semantic reconstructor. • An effective attention module is implemented, which exploits the temporal structure for each channel, where the c-th channel of input feature maps at t-th time step is the weighted sum of c-th channel of all feature maps. • A semantic reconstructor network is proposed which is composed of LSTM to assist the decoder to fully use the semantic information. • Experiment results on Microsoft Video Description (MSVD) dataset [12] and MSR-Video To Text (MSR-VTT) dataset [13] demonstrate the feasibility and effectiveness of the proposed method.

Related Work
The early visual captioning methods normally detect Subject, Verb, and Object separately using a rule-based system [14,15], then generate a sentence by joining them together. However, Vinyals et al. [16] make a breakthrough in image captioning by using deep learning methods. They generate natural languages directly from image features by using a recurrent neural network (RNN). Subsequently, video captioning based on deep learning becomes a topic of significant research interests. Compare with images, videos are more complex and diverse, which imposing great challenges for understanding video content. Donahue et al. [17] propose the long-term recurrent convolutional networks (LRCN), which Future Internet 2021, 13, 55 3 of 18 is the first video captioning model based on deep learning. It extracts the video features of each frame through the pretrained model for image classification and then inputs the features into an LSTM network after average pooling for decoding to obtain the sentences. The model has been widely adopted as a benchmark model for video captioning. Since then, many innovations and improvements have been reported on the model. Sequence Learning Approaches: Venugopalan et al. [18] first propose an end-to-end model for generating video captions. Their model takes one frame of every 10 frames of video, extracts the output features of the full connection layer through a convolution neural network pretrained with ImageNet [19], and averages the output of all the frames of the video to get a 4096-dimensional vector, which is then used as the input of the first layer of LSTM at every time step. The output of the LSTM decoder obtains the probability distribution of each word in the vocabulary through SoftMax. Since the Venugopalan method for image features uses the average pooling, it loses the temporal information in a video. Venugopalan proposes S2VT [20], which uses LSTMs as both encoder and decoder, and incorporate with the optical flow to obtain the temporal information. To address the better temporal dynamics of the video sequence adequately, Yao et al. [21] introduce an attention mechanism to assign weights to the features of each frame and then fuse them based on the attentive weights. As for semantics, the relevance between video context and sentence semantics is considered to be a regularizer in the LSTM [22]. Pan et al. [23] employ high-level semantic attributes as a supplementary representation for caption generation by using a transfer unit. Gan et al. [11] train a semantic detection network by using a multi-label classification method to generate semantic tags and embeds the semantic information by a semantic compositional network to generate sentences. Wang et al. [24] propose a reconstructor for video captioning, a dual learning mechanism that reconstructs source from the target by a backward flow. To obtain finer motion information among successive video frames, Xu et al. [6] conduct feature clustering by parameter sharing recurrent convolutional network and then adopt an end-to-end and sequence-to-sequence framework for encoding and decoding, which shows good performance. Due to the temporal characteristic of video and language, sequence learning is always an efficient method. In this paper, the encode-decode module based on sequence video VLAD and semantic reconstructor can be both regarded as a sequence learning method.
Attention Mechanism: Selectivity is an important characteristic of the neural mechanism of the human brain, which represents the ability to filter out unimportant information, and could be called attention mechanism. Recently, many attention methods have been proposed to improve algorithm performance in various tasks. Luong et al. [8] propose global attention and local attention for neural machine translation. Global attention is also called soft attention in other research. Yao et al. [21] apply the soft attention mechanism in the video captioning task and it has shown a good performance. Stollenga et al. [25] introduce a deep attention selective network to solve image classification tasks. Chorowski et al. [26] explore attention-based recurrent networks for speech recognition. Pan et al. [27] make a great improvement in the image caption task by X-Linear Attention. A recent work of Hu et al. [28] designs the squeeze-and-excitation module (SEnet) to exploit the interchannel relationship. Based on SEnet, Woo et al. [7] exploit both spatial and channel-wise attention. In addition, introduce the convolutional block attention module (CBAM), which can be used as a plug-and-play module for pre-existing CNN-based architectures. These two modules [7,28] have shown the great importance of channel attention. However, due to videos having a temporal structure, the global structure of the channel has not been taken seriously. We combine the advantages of channel attention and soft attention to exploit the global structure for each channel.
Based on these previous works above, the model proposed in this paper considers the attention mechanism comprehensively from temporal and channel. Furthermore, the semantic concepts are embedded in the decoder and the semantic vector is reconstructed to fully use the semantic information. This can be seen as a novel loss for generating accurate captions.

Method
In this paper, a channel soft attention improved sequential video VLAD (SeqVLAD) [6] with a semantic reconstruction network (CSA-SR) for video captioning is proposed. The overall structure is shown in Figure 1. The proposed CSA-SR mainly contains three parts, including encoder, decoder, and semantic reconstructor. The encoder extracts video features and encodes features for the decoder. Frame-level features are extracted from video by using 2D convolutional neural networks, and we also employ 3D convolutional neural networks to extract video features from video clips. The fine motion information among successive video frames is encoded by our proposed channel soft attention and SeqVLAD. In addition, the features extracted from 2D and 3D convolutional neural networks are both used in the semantic detection network (SDN). As for decoder, the semantic LSTM is used to combine the video representation and semantic vectors to generate sentences. The reconstructor will rebuild semantic vectors. By fully using the bidirectional flows of semantic vector and sentence, the performance of video captioning can be further boosted. In the following part, we briefly introduce the VLAD [3] and SeqVLAD first. Then present the details of the channel soft attention and semantic reconstruction network of our model.

VLAD and SeqVLAD
The VLAD can be seen as a kind of simplified Fisher Vector. The Fisher vector encoding uses GMM (Gaussian Mixture Model) to construct visual descriptors. In addition, the VLAD does not store second-order information about the features and typically use K-Means instead of GMMs to generate descriptors. The VLAD calculates the differences between the local descriptor and its nearest cluster center, then aggregates the local descriptors into a fixed dimension vector. Suppose we employ K-means to generate K centers C = {c 1 , c 2 , . . . , c K }. In addition, the local descriptors with dimension D are X = {x 1 , x 2 , . . . , x N }, the VLAD encoding can be expressed by the following formulation: where N is the number of descriptors, x n is the n-th descriptor, c k is the k-th cluster center, and the assignment a k (x n ) represents the relationship between current descriptor with k-th cluster center. If c k is the nearest cluster center to current descriptor, a k (x n ) = 1, and otherwise, a k (x n ) = 0. VLAD calculates ν k over all cluster centers as V = {ν 1 , ν 2 , . . . , ν k }. With the dimension D of each center, we can compute the final representation with dimension The VLAD performs well in image retrieval. However, because assignment a k (x n ) is discontinuous and non-differentiable, how to train VLAD in network is a difficult problem. The purpose of SeqVLAD is to plug VLAD into a deep neural network to aggregate the video representation locally and temporally. SeqVLAD tries to leverage the benefits of both convolutional units and the VLAD encoding method in RNNs architecture, using parameter shared gated recurrent convolutional units (SGRU-RCN) as the assignment of VLAD encoding. Recurrent convolutional networks (RCNs) successfully combine CNNs and RNNs which could obtain the video features in both spatial and temporal for video processing tasks. It has been shown that RNN focuses more on global feature changes, discards the fine motion information between successive video frames. To get more detailed information, the recurrent unit named gated recurrent convolutional units (GRU-RCN) which replace the fully connected operation with convolution operation, inputs a 3-dimensional feature map and outputs learned feature map with the same dimension. To reduce the number of parameters and the chance of overfitting, SGRU-RCN shared the input-to-hidden convolutional kernels. An SGRU-RCN unit for VLAD encoding is defined as follows: where * denotes a convolution operation, σ is a sigmoid function, and is an element-wise multiplication. The input x t at time step t is a 3-dimensional feature map with shape C × H × W, where C, H and W denote the number of channels, height, and width of the feature maps, respectively. The shape of W q is K × C × 1 × 1. W q is the shared convolutional kernel for z t , r t , and a t , where z t is an update gate indicating whether previous information needs to be updated, and r t is a reset gate indicating whether previous information needs to be reset. Parameters U Z , U r , U are 2-dimensional convolutional kernels. The output of SGRU-RCN a t with shape K × H × W can be taken as the weights of spatial and temporal assignment for VLAD encoding. a t can be viewed as a k t (i, j), denoting the assignment weights of aggregating the descriptor at location (i, j) of t-th frame to the k-th cluster center. Because the range of assignment values should be 0 to 1, the output a k t (i, j) is wrapped by a so f tmax function, which can be described as: where K denotes the number of total cluster centers. The SeqVLAD encoding would be: where x t (i, j) represents the descriptor at location (i, j). In addition, T represents the input sequence length. Referring to Equation (8), it accumulates the differences between descriptors and cluster centers locally and temporally. The output of SeqVLAD concatenates ν k over K cluster centers and the final representation with dimension K × D is the same as VLAD.

Channel Soft Attention
The input of SeqVLAD at each time step is a 3-dimensional feature map, which is the output of effective models pretrained on ImageNet [19]. If SeqVLAD receives the feature maps of one frame at each time step, it would lose the ability to exploit the global Future Internet 2021, 13, 55 6 of 18 temporal structure of the video. The features and information in the later frames of a video are associated with the previous frames. Inspired by soft attention applied in natural-language processing successfully, we propose a channel soft attention to exploit the temporal structure for 3-dimensional image feature maps. As shown in Figure 2, in a sequence of video features, the shape of feature maps at each time step is C × H × W. Each block in Figure 2 with different colors denotes one feature map with shape H × W. The feature map at i-th channel at each time step is generated by the same convolution kernel. Our approach is to exploit the temporal structure for each channel. Instead of taking each feature map of one frame at each time step, we take the dynamic weighted sum of the temporal feature maps for each channel. The input of SeqVLAD at each time step is defined as: where ∑ N i=1 A t c,i = 1 and A t c,i is computed at each time step t inside the SGRU-RCN encoder, it denotes the attention weight of i-th frame's channel c at time step t. The value of attention weight reflects the relevance of the i-th temporal feature of each channel in the input video given all the previously generated assignments. v c,i denotes channel c's feature map of i-th frame. Concatenating all the ϕ t (V c ) can get the feature maps with shape C × H × W, which is the same as the original shape of feature maps of one frame. We obtain the attention weight A t c,i by normalizing the relevance scores e t c,i : · · · · · · Feature maps of video frame Weighted feature maps for each time step  We design a function that takes the previous assignment a t−1 of the SGRU-RCN as input and summarizes all the previously generated assignments and the feature maps of the i-th temporal feature. The function gives the unnormalized relevance score e t i : where W a , U a are the convolutional kernels and b a are the parameters, those are trained together in the encoder-decoder networks, v is the input feature maps of the entire video sequence. The relevance score e t i with shape N × C involves the score of each channel at Future Internet 2021, 13, 55 7 of 18 each time step, and CA(·) is a channel attention module. To compute the 2-dimensional relevance score with shape N × C × 1 × 1, we use the channel attention module in CBAM [7], and make a small change as follows. Consider that sigmoid ranges from 0 to 1 and tanh ranges from −1 to 1. To expand the differentiation of each weight in a controllable range, the activate function sigmoid is replaced by tanh. e t i with C channels is the relevance score of i-th frame at t-th time step. It can be viewed as e t c,i to represent the score for each channel.

Semantic Reconstruction Network
A convolutional GRU can get rich temporal and spatial information of a video. However, if we want to generate semantically accurate sentences, semantic information is needed to enrich the features that get from the SeqVLAD encoder. In our approach, the semantic detection network (SDN) is used for generating the semantic vectors and semantic compositional network (SCN, a variant of LSTM) is took as the decoder. Moreover, a dual learning mechanism is adopted to maintain the information of the semantic vector during training, and the semantic information in forward flow can be better used.
For semantic detection, the candidate semantic representation should be obtained first, there are some state-of-art methods such as keyword-based representation [29] and semantic-based representation [30]. Our ground-truth labels are some short sentences, and those sentences of a video might be composed of different words. Moreover, the vocabulary does not contain huge amounts of words, some videos cannot be contained by semantic representation if using semantic represent methods. To avoid the influence of semantic represents methods, we built the semantic representation tags by the manual selection which are K most common words in the training set. Manual selection is not feasible if the dataset is extremely large, then the semantic representation tags should build by semantic represent methods such as TF-IDF [29]. The task of semantic detection can be seen as a multi-label classification task. For more accurate classification, the fully connected features of the 3D convolutional network and 2D convolutional network are extracted, then concatenate these two features as the input of SDN. Suppose the representation of i-th video is F i , then, the ground-truth is y i = {y i,1 , y i,2 , ..., y i,K } ∈ {0, 1} K . If the j-th tag exists in the i-th video's annotations, then y i,j = 1, otherwise y i,j = 0. A semantic detection network can be implemented by a deep multi-layer perceptron (MLP), which can be defined as: where σ is the sigmoid activation function, and s i is the semantic vector of i-th video. The loss function of SDN is defined as: As for decoder, SCN-LSTM could avoid the problem of long-term dependency, and embeds the semantic attributes of the input video. Suppose there are encoded feature x t at time step t, the hidden state h t−1 at time step t − 1 and semantic feature s.
where c denotes the cell state, i denotes the input gate, f denotes the forget gate, and o denotes the output gate. w and U are learnable weight matrices. The SCN-LSTM can be described as: where σ demotes the sigmoid function and b is a bias for each gate. The semantic vector for each video is a 1-dimensional vector with shape 300 (as in paper [11]). It is a smaller vector than video features, so there are fewer parameters calculate with semantic vector, semantic vector might lose its impacts on the result because of a few bad trained parameters. To make better use of semantic information, we propose a semantic reconstruction network to rebuild the semantic vector which input to the decoder, if the semantic information can be used effectively, the reconstructed semantic vector will be more complete. As shown in Figure 3, an LSTM reconstructor is implemented to the decoder the semantic information, then average pooling the output of each time step to compute the reconstructed semantic vector: where h i is the output of LSTM at i-th time step, φ(·) is an average pooling process, and s r represents the reconstructed semantic vector. The reconstruction loss is measured by Mean Square Error (MSE) loss: Loss rec (s, s r ) = MSE(s, s r ) LSTM LSTM LSTM LSTM · · · · · · · · · AvgPooling MSE Loss Reconstructor Figure 3. An illustration of the proposed semantic reconstructor module.

Training
Formally, the encoder-decoder module and reconstructor module are trained together. To generate the correct captions, the encoder-decoder module is trained by minimizing the negative log-likelihood. The loss of the encoder-decoder module can be defined as follows: where y t denotes the word generated at time t, v is the encoded video features and semantic vector, and θ represents the parameters need to optimize. The total loss involving the encoder-decoder and reconstructor can be defined as: where λ is a hyper-parameter to trade-off the encoder-decoder module and reconstructor module.

Experimental Evaluation
Our model is evaluated by using two benchmark datasets of video description, including the MSVD (Microsoft Video Description) dataset [12] and MSR-VTT (MSR-Video To Text) dataset [13]. We briefly introduce these datasets and the evaluation metrics in our experiments first, then show the experimental details and results.

Datasets
MSVD is composed of 1970 YouTube open domain videos that are popular for video captioning. Each clip in this dataset shows one single activity. Generally, it spans over 10 to 25 s. The ground-truth captions are annotated by multiple humans. On average, each clip has approximately 40 ground-truth captions in English. For benchmarking, the training, validation, and testing splits are set to 1200, 100, 670 respectively in our experiments referred to [31].
MSR-VTT is a large-scale benchmark dataset which consists of 10,000 clips that are transformed from 7180 videos. The clips were divided into 20 different categories. In addition, each video clip is annotated by Amazon Mechanical Turk (AMT) workers using 20 single sentences. Following the official evaluation protocol provided by [13], the training, validation, and testing splits include 6513, 497, and 2990 clips, respectively.

Evaluation Metrics
To compare our methods objectively with existing methods, we evaluate our models with four popular metrics such as METEOR (Metric for Evaluation of Translation with Explicit Ordering) [32], BLEU (Bilingual Evaluation Understudy) [33], ROUGE-L (Recall Oriented Understudy of Gisting Evaluation) [34] and CIDEr (Consensus-based Image Description Evaluation) [35]. The results can be computed using the Microsoft COCO server [36]. The higher score of these metrics, the higher quality of the generated captions.

Implementation Details
Before training, each dataset contains a small number of captions with too few or too many words, which is bad for training. We removed captions where the number of words is more than 30 or less than 4. In addition, build a semantic vocabulary by manual selection of the most common words in the dataset. It contains the keywords in a sentence such as "man", "boy", "playing", "riding". The vocabulary is all composed of "subjects", "verbs" and "objects", which can represent the main idea of the video. Our experiments are based on MSVD and MSR-VTT datasets, because of the low complexity of datasets, the size of semantic vocabulary is set to 300 (as in paper [11]).
For semantic detection, 2D and 3D video features are needed to be extracted for training a semantic detection network (SDN). The image representation we used is the output of 1536-way last AvgPool2d layer from InceptionResNetV2 [37] pretrained on the ImageNet dataset [19]. In addition, the video representation is extract by a 3D CNN (C3D) [38] which pretrained on the Sports-1M video dataset [39]. The video representation is a 4096-way vector obtained from the f c7 layer of C3D. In addition, uniformly extract 40 frames from each video and take the RGB video frames as input. Video frames are resized to 112 × 112 for C3D and resized to 299 × 299 for InceptionResNetV2. The input of C3D is video clips with 16 frames (as in paper [39]) and has 8 frames overlap. We train the semantic detector using the procedure described in model architecture. For sequence VLAD encoding, we take the output of Conv2d_7b layer from InceptionResNetV2 which with shape 1536 × 8 × 8.
For model training, there are some well-known word embedding methods, such as GloVe [40], FastText [41] and word2vec [42]. The end-to-end video caption model proposed in this paper can be learned online, but GloVe is a count-based method that needs overall statistics. FastText uses n-gram and subword to consider the order of words, so it would build a large corpus. The last linear layer in our model which output the vectors with the size of the corpus takes up nearly half of the parameters. Using FastText will double the number of parameters, so we initialize the word embedding vectors with the publicly available word2vec vectors. In the decoder, the size of word embedding, feature embedding, and the hidden state of semantic LSTM is set to 512. Because of various lengths of sentences, every sentence added <BOS> and <EOS> to indicate the begin of a sentence and the end of a sentence. To avoid overfitting, the dropout value is set to 0.5. The batch size during training is 64. The model is optimized by Adam [43] with learning rate 2 × 10 −4 . We empirically train our model with 50 epochs.
For testing, we employ a beam search algorithm to generate sentences. Beam search selects k highest probability words at each time step, then input these k words into the next time step to generate new k highest probability words. The beam size is set to 5 in our experiments.

Results and Analysis
We obtained the configuration of the modules through experiments first, then evaluate them on the dataset with the best performing model.

Study on Modules
The channel soft attention could have a different structure by using different channel attention module. We use the channel attention part of SEnet [28] and CBAM [7] combined with soft attention, respectively. These two kinds of channel attention are quite similar and both with 200 k parameters. We evaluated these two original attentions and the attention replaced the activate function sigmoid with tanh. The results are shown in Table 1. We use SeqVLAD as a baseline which denotes by "SVLAD", "SE" and "CA" denote the original SEnet and CBAM (channel attention part). "SE − tanh" and "CA − tanh" denote the attention module with tanh activate function. As Table 1 shows, the results of SeqVLAD optimized with different channel soft attention are very close. This is because the difference between these two channel attentions is negligible, CBAM has one more maxpool operation than SEnet, and the performance of CBAM is not much better than SEnet in the classification task. The activate function tanh may not have much improvement, but it could work well in this model. All these models with attention have made a significant improvement compared to the baseline, which shows the effectiveness of channel soft attention. As for semantic reconstructor, the parameter λ in Equation (26) indicates adding how much reconstruction loss in training, too large λ may ruin the caption performance. To select the best λ, we did some comparative experiments with different λ on two structures. As shown in Figure 4, The comparison on metric BLEU-4 shows that larger is not better. We empirically set λ to 0.2 in the following experiments. The parameters and multiply-accumulate operations (MACs) of each module are shown in Table 2. The MACs of each module denotes the multiply-accumulate operations of one video. "CA" denotes the channel attention module, "SDN" demotes the semantic detection module and "CSA-SR" denotes the final module in our paper. As Table 2 shows, the Decoder module occupies more than half the parameters of "CSA-SR" and MACs. The total parameters and MACs are not equaled to the final model "CSA-SR" because there are some extra operations to connect these modules.

Results on MSVD Dataset
Before comparing it with other methods, we did some comparative experiments to prove the effectiveness of our method, and report the results of the different combinations of our method in Table 3. We first evaluate the original SeqVLAD method and its combination with channel soft attention, semantic LSTM decoder, and semantic reconstructor, where "SVLAD" denotes the SeqVLAD, "CSA" denotes the channel soft attention, "SLSTM" denotes the semantic LSTM which semantic information has been added, and "SR" denotes the semantic LSTM with reconstructor. The cluster center is set to 32 because of the good overall performance shown in paper [6]. As Table 3 shows, a single channel soft attention (SVLAD CSA ) can make a significant improvement in METEOR with 3.85%. The single semantic LSTM (SVLAD SLSTM ) and semantic reconstructor (SVLAD SR ) gain over the original SeqVLAD 2.34% and 2.96%, respectively. The channel soft attention improved model has a higher score in CIDEr than others. It shows that channel soft attention makes more improvement. However, semantic reconstructor got a higher score in BLEU-4. This is because the semantic reconstructor has fully used the semantic tags. These tags indicate the most common words in the dataset. In generated captions, the more matched words, the higher the BLEU-4 score. The channel soft attention combined with semantic LSTM decoder (SVLAD CSA−SLSTM ) does not have much improvement compared to SVLAD CSA . Channel soft attention with semantic reconstructor (SVLAD CSA−SR ) has shown the best overall performance. Then our method is compared with some state-of-the-art methods. The main ideas of the methods in Table 4 are introduced as follow: 1.
S2VT [20] is an encoder-decoder framework with two layers of LSTM, where the first layer encodes the video representation and the second generates video descriptions. 2.
SA [21] which is based on temporal attention focuses on both the local and global temporal structure of the video, and select the most relevant temporal segments automatically.

3.
h-RNN [44] introduces a hierarchical structure for sentence generator and paragraph generator. Both sentence generator and paragraph generator are made up of the recurrent neural network. 4.
HRNE [45] is a hierarchical recurrent neural encoder with a soft attention mechanism and explores the temporal transitions between frame chunks with different granularities.

5.
SCN-LSTM [11] detects semantic tags before training caption models, and uses semantic compositional networks to embed semantic information for caption generating. 6.
PickNet [46] selects the useful part of video frames for the encoder-decoder framework. It aims to reduce the computation and the noise caused by redundant frames. 7.
TDDF [47] uses a task-driven dynamic fusion mechanism to reduce the ambiguity in the video description and refines the degree of the description of the video content. 8.
RecNet [24] is a dual learning model that has forward flow and backward flow while training a model. The video representation will be reconstructed using the output of the decoder. 9.
TDConvED [48] introduces a TDConvED that constructs encoder and decoder by fully use convolutions. 10. GRU-EVE [49] mainly applied Short Fourier Transform into CNN features which could enrich the visual features with temporal dynamics.
The performance of our model with baseline and existing models above are compared on the MSVD dataset. As Table 4 shows, SeqVLAD is a baseline in our experiments, the baseline with single semantic reconstructor (SR) and the baseline with single channel soft attention (CSA) both show a great improvement in METEOR, BLEU-4, and CIDEr. Moreover, our final model (CSA-SR) achieves a strong 83.4 value of CIDEr, which is 3.86% higher than the closest competitor RecNet [24]. The score of METEOR has only 1.71% higher than GRU-EVE [49]. This might be because GRU-EVE has applied object information and the object information might be effective in improving performance. For other metrics, the score of our method remains competitive with other methods.  Figure 5 shows some examples of generated captions in MSVD datasets. We divided them into three categories: correct captions, relevant but incorrect captions, and irrelevant captions. Generally, the correct captions are very close to the ground-truth. For instance, our model generates "a man is pouring some ingredients into a bowl" in the third video, which can describe in detail the corresponding video content. There are mainly two kinds of mistake in relevant but incorrect captions. For example, in the first relevant video, "a man is jumping on a horse" has an incorrect verb "jumping on" which should be replaced by "riding". This mistake also appears in the second relevant video: "cutting" in "a man is cutting a bread" should be replaced by "buttering". "a man is riding a horse" in the third relevant video contains the wrong noun "horse" which should be replaced by "motorcycle". These mistakes might be due to the lack of training samples. Although our proposed model has achieved satisfactory results, the existence of irrelevant captions shows that our model does not always generate what we need. Our method still has a lot of room for improvement, in the sentence generation part, the quality of sentences may be improved by other learning strategies such as sequence-level reinforcement learning.

Results on MSR-VTT Dataset
The performance of our method is compared with recent models on MSR-VTT dataset, such as RecNet [24], TDDF [47], RUC-UVA [50], Alto [51], PickNet [46], TDConvED [48] and GRU-EVE [49]. The results are summarized in Table 5. It shows our method is competitive to the state-of-the-art method on MSR-VTT on these four metrics. These results confirm the effectiveness of our proposed channel soft attention and semantic reconstructor for video captioning.

User Study
The automatic sentence evaluation metrics may not show the quality of generated sentences. To evaluate the model with human judgment, a user study should be conducted. We got 10 people to evaluate the generated sentences of two testing subsets. These two testing subsets are randomly selected from the MSVD dataset and MSR-VTT dataset with 20 samples, respectively. In addition, the testing sentences are generated by our final model (CSA-SR). Users watch the same 20 videos of MSVD and MSR-VTT testing subsets, then select a proper sentence for the corresponding video from our generated sentence and a ground-truth sentence, where the ground-truth sentence is randomly selected. As shown in Table 6, "pred" denotes the number of generated sentences chosen by users, "gt" denotes the ground-truth sentences, and "rate" denotes the proportion of generated sentences. The higher "rate" indicates that the generated sentences are more appropriate from the user's perspective. The average rate shows generated sentences have a competitive quality, especially on the MSVD dataset.

Conclusions
Existing convolutional RNNs for video feature extraction have exploited the video's global temporal structure for fully connected features, but not include the global temporal structure of each channel, and the semantic information is not fully used. In this paper, a video captioning model based on channel soft attention and semantic reconstructor is proposed. Specifically, the soft attention is combined with channel attention, to compute every weighted channel for each time step based on the global feature maps. A semantic reconstructor is introduced to make semantic information fully used. Compared with the state-of-the-art methods on two popular benchmarks MSVD dataset and MSR-VTT dataset, the results of our experiment have shown that our methods achieve competitive or even superior performance. In the future work, we will explore more effective approaches such as reinforcement learning to optimize the sentence-level metrics for more interpretable sentences.