Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual Formation
Abstract
:1. Introduction
2. Related Work
2.1. Template Matching
2.2. End-to-End Model
3. Proposed Methodology
3.1. Feature Extraction
3.2. Encoder-Decoder with Bi-LSTM and Sequential LSTM
3.3. Attention Mechanism
3.4. Sentence Generation
4. Experiment
4.1. Datasets
- (i)
- MSVD [41]: The dataset encloses 1970 YouTube video cuts. The dataset holds approximately 80 k sentences with a vocabulary size of 13,010. The absolute term of this dataset is about 5.3 h. The dataset was split into 1200 recordings for training clips, 100 recordings for validation clips, and the rest for testing.
- (ii)
- MSR-VTT [35]: The dataset is one of the large video datasets, comprises 10 k video cuts alongside an all-out season of 41.2 h. It has 20 unique classes similar to sports, music, gaming, and so forth—the typical length of those clips is around 20 s with a colossal vocabulary of 29,316. The dataset has 6513 training clips, 2990 validation clips, and 497 testing clips.
4.2. Evaluation Metrics
- (i)
- BLEU [37]: It is one of the most notable measurements for assessing machine-produced sentences. It can be ready to figure out the correspondence between machine-created sentences and ground truth sentences. This measurement can give the best outcome regarding short sentences. The metrics BLEU (N = 1, 2, 3, 4) typically evaluate N-gram matches’ precision in a sentence.
- (ii)
- METEOR [38]: It is one of the most utilized measurements for valuation machine-produced sentences. It has unigram-based accuracy and review rates. It has the highlights of equivalent coordinating, stemming, and definite word coordinating. The metrics are more robust than other metrics in terms of the human verdict.
- (iii)
- CIDEr [39]: This measurement is primarily utilized for image subtitling. Video is only a picture containing different ceaseless pictures. It can be ready for an agreement between machine-created sentences and human-explained sentences.
- (iv)
- ROUGE [40]: ROUGE is another measurement for valuation. It has highlights of n-gram events. It has four distinct kinds of measurements, where ROUGE-L is generally utilized for visual inscription assessment.
4.3. Implementation Details
- (i)
- Hardware setup: The experiments have been performed on our local personal computer, which contains an AMD Ryzen 3600x, 6-Core processor (3.80 GHz–4.4 GHz) with 16 GB of DDR4 RAM (3200 MHz). For faster computation, a single Nvidia RTX 3070 GPU (5888 CUDA-Cores) with 8 GB of video memory was also used.
- (ii)
- Preprocessing: For video preprocessing and 2D global temporal feature extraction, the model was tested similarly by dividing 25 frames for every video. A VGG-16 model [42] pre-trained on the ImageNet dataset [57] was used for separating visual appearance highlights. We resized the frames to 224 × 224 resolution. The VGG-16 model selected an arrangement of 4096-dimensional element vectors delivered by the completely associated with the fully connected fc7 layer. For the InceptionV3 [47] and Xception [48], the pre-trained model, we resized the frames to 299 × 299 resolution.Moreover, to extract the motion, we used an Inflated 3D network (I3D) pre-trained in kinetics [50] dataset. Then, the classifier layer was tweaked from all the models. The model was very efficient for action identification. We used faster R-CNN [51] for local features such as face detection. In addition, for the sentence portion, we have used word tokenizer NLTK [54] based tool and set the vocabulary of every dataset in the lower-case and removed the punctuations from the sentence. The total vocabulary size of the two datasets is approximately 42 k.
- (iii)
- Experimental Setup: We have proposed an attention-based bidirectional-LSTM model [53] for video captioning. The model can predict both the past and future words and combine them to generate a meaningful sentence. Both forward and in reverse data can be handled to catch relevant data. The attention-based two-layered LSTM [53] has 512 hidden nodes in each layer. The visual substance is separated from the various proposed CNN models [42,47,48] and the primary layer, along with an attention mechanism to feed the decoder. Moreover, the decoder with sequential LSTM [42] and the attention layer also consists of 512 hidden nodes in each layer. The regularization dropout layer is used in our proposed model. We set the dropout 0.5 at both the encoder and decoder levels. We used the ADAM optimizer to generate the loss function. We defined the starting learning rate to neglect the gradient burst. In addition, we included a beam search algorithm with size 5 to produce the textual description. Then, we trained our model with a batch size of 64.
5. Experimental Results and Comparison
5.1. Quantitative Performance Comparison with Baseline Methods
- (i)
- HATT [9]: The method introduced a two-level LSTM branch as a hierarchical order and dealt with low and high-level attention. The method semantically enriched for its fused multiple modalities for caption generation, but the period of feature extraction was very extraordinary. The method also considered the two public datasets to measure the performance. However, the model scored 52.9% in B-4 metrics. Our model, Att-BiL-SL (compared with the best score), attained a better result than HATT, with 1.9% and 13.1% increases on M and C, respectively, on the MSVD dataset. In terms of the MSR-VTT dataset, our model increased the performance by 0.2% and 13.1% in M and C metrics, respectively.
- (ii)
- ICA-LSTM [10]: The model consisted of multiple LSTM [52] for video captioning and used dual-stage loss functionality. The method was also trained in various CNN models. Our model again accomplished a better outcome than ICA-LSTM, with 0.6% and 4% upsurges on M and C in the MSVD dataset. However, our model increased by 1% and 4% on M and C in the MSR-VTT dataset.
- (iii)
- SF-SSAG-LSTM [12]: The method introduced augmented audio features and semantic filtration. The model generated a single-line video caption. The performance of our method upsurged 0.3% and 12% on M and C metrics in the MSVD dataset and 0.4% and 2.5% on M and C in the MSR-VTT dataset.
- (iv)
- LSTM-GAN [13]: The model introduced a generative adversarial network and attention mechanism for video captioning. Compared to this model, our method performed better than LSTM-GAN, with performance gains of 8.3% and 5.3% on B-4 and M metrics in the MSVD dataset. However, the performance increased by 5.2% and 2.6% in the MSR-VTT dataset.
- (v)
- DenseLSTM [17]: Attention-based dense LSTM was considered for video captioning. The model had a backward cell connected with a forward cell. The exhibition of this model did not beat our proposed model. Our model increased by 0.8%, 2.8%, and 14.3% in B-4, M, and R-L measurements on the MSVD dataset. Likewise, it increased 3.1%, 2.1%, and 6.5% on the MSR-VTT dataset.
- (vi)
- CAM-RNN [20]: CAM encoded the visual and literary highlights, and RNN kept up the decoder state to create video captioning. Our model, Att-BiL-SL (contrasted and the best score), achieved a better outcome than CAM-RNN, with 8.8%, 2.3%, 32.6%, and 2.7% increments in B-4, M, C, and R-L, separately (MSVD). Our model built the exhibition with 6.5%, 0.8%, 10.5%, and 1.6% on B-4, M, C, and R-L measurements (MSR-VTT), separately.
- (vii)
- VRE [23]: The proposed method with refocused RNN with spatial-visual features was introduced for better video captioning. The method outperformed on B-4 and R-L metrics with 43.2% and 62.0% in the MSR-VTT dataset. Contrasted with this model, our strategy performed better than VRE, with execution increases of 1.4%, 0.2%, and 0.2% on M, C, and R-L measurements in the MSVD dataset. Nonetheless, the presentation incremented by 0.7%, and 1%, on M and C in the MSR-VTT dataset.
- (viii)
- MR-HRNN [25]: The proposed structure ready to perceive human-related activities with objects during sports. The structure also introduced a pose attribute detection module and a description generation module. In the MSVD dataset, our model expanded the exhibition by 0.7%, 3%, and 17.3% on B-4, M, and C measurements. In the MSR-VTT dataset, the exhibition expanded by 5%, 3.1%, and 15.8% on B-4, M, and C, individually.
- (ix)
- STA-FG-RC [32]: The method with semantic, temporal attention for a video description in textual formation. The model can expressly join with the high-level visual idea to create temporal attention. Our method performed better than STA-FG-RC, with performance gains of 1.2% on M metrics in the MSVD dataset. However, the performance increased by 0.4% and 1.3% in the MSR-VTT dataset.
6. Quantitative Analysis and Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Cisco. The Zettabyte Era: Trends and Analysis; Cisco: San Jose, CA, USA, 2017; Available online: https://en.wikipedia.org/wiki/Zettabyte_Era (accessed on 3 April 2021).
- Muhammad, K.; Hussain, T.; Baik, S.W. Efficient CNN based summarization of surveillance videos for resource-constrained devices. Pattern Recognit. Lett. 2020, 130, 370–375. [Google Scholar] [CrossRef]
- Sridevi, M.; Kharde, M. Video Summarization Using Highlight Detection and Pairwise Deep Ranking Model. Procedia Comput. Sci. 2020, 167, 1839–1848. [Google Scholar] [CrossRef]
- Chu, Y.-W.; Lin, K.-Y.; Hsu, C.-C.; Ku, L.-W. Multi-Step Joint-Modality Attention Network for Scene-Aware Dialogue System. 2020. Available online: http://arxiv.org/abs/2001.06206 (accessed on 16 September 2020).
- Huang, J.H.; Worring, M. Query-controllable video summarization. In Proceedings of the 2020 International Conference on Multimedia Retrieval, Dublin, Ireland, 8–11 June 2020; pp. 242–250. [Google Scholar] [CrossRef]
- Li, Z.; Li, Z.; Zhang, J.; Feng, Y.; Zhou, J. Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog. In IEEE/ACM Transactions on Audio, Speech, and Language Processing; IEEE: Piscataway, NJ, USA, 2021; Volume 29, pp. 2476–2483. [Google Scholar] [CrossRef]
- Pan, G.; Zheng, Y.; Zhang, R.; Han, Z.; Sun, D.; Qu, X. A bottom-up summarization algorithm for videos in the wild. EURASIP J. Adv. Signal Process. 2019, 2019, 15. [Google Scholar] [CrossRef] [Green Version]
- Nian, F.; Li, T.; Wang, Y.; Wu, X.; Ni, B.; Xu, C. Learning explicit video attributes from mid-level representation for video captioning. Comput. Vis. Image Underst. 2017, 163, 126–138. [Google Scholar] [CrossRef]
- Wu, C.; Wei, Y.; Chu, X.; Weichen, S.; Su, F.; Wang, L. Hierarchical attention-based multimodal fusion for video captioning. Neurocomputing 2018, 315, 362–370. [Google Scholar] [CrossRef]
- Xiao, H.; Xu, J.; Shi, J. Exploring diverse and fine-grained caption for video by incorporating convolutional architecture into LSTM-based model. Pattern Recognit. Lett. 2020, 129, 173–180. [Google Scholar] [CrossRef]
- Jin, T.; Li, Y.; Zhang, Z. Recurrent convolutional video captioning with global and local attention. Neurocomputing 2019, 370, 118–127. [Google Scholar] [CrossRef]
- Xu, Y.; Yang, J.; Mao, K. Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature. Neurocomputing 2019, 357, 24–35. [Google Scholar] [CrossRef]
- Yang, Y.; Zhou, J.; Ai, J.; Bin, Y.; Hanjalic, A.; Shen, H.T.; Ji, Y. Video Captioning by Adversarial LSTM. IEEE Trans. Image Process. 2018, 27, 5600–5611. [Google Scholar] [CrossRef] [Green Version]
- Song, J.; Guo, Y.; Gao, L.; Li, X.; Hanjalic, A.; Shen, H.T. From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3047–3058. [Google Scholar] [CrossRef] [Green Version]
- Gao, L.; Guo, Z.; Zhang, H.; Xu, X.; Shen, H.T. Video Captioning with Attention-Based LSTM and Semantic Consistency. IEEE Trans. Multimedia 2017, 19, 2045–2055. [Google Scholar] [CrossRef]
- Bin, Y.; Yang, Y.; Shen, F.; Xu, X.; Shen, H.T. Bidirectional long-short term memory for video description. In Proceedings of the 24th ACM international conference on Multimedia, Amsterdam, The Netherland, 15–19 October 2016; pp. 436–440. [Google Scholar] [CrossRef] [Green Version]
- Zhu, Y.; Jiang, S. Attention-based densely connected LSTM for video captioning. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 802–810. [Google Scholar] [CrossRef]
- Francis, D.; Huet, B. L-STAP: Learned spatio-temporal adaptive pooling for video captioning. In Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery-AI4TV’19, Nice, France, 21 October 2019; pp. 33–41. [Google Scholar] [CrossRef] [Green Version]
- Xu, N.; Liu, A.-A.; Wong, Y.; Zhang, Y.; Nie, W.; Su, Y.; Kankanhalli, M. Dual-Stream Recurrent Neural Network for Video Captioning. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2482–2493. [Google Scholar] [CrossRef]
- Zhao, B.; Li, X.; Lu, X. CAM-RNN: Co-Attention Model Based RNN for Video Captioning. IEEE Trans. Image Process. 2019, 28, 5552–5565. [Google Scholar] [CrossRef] [PubMed]
- Xiao, H.; Shi, J. Video Captioning with Adaptive Attention and Mixed Loss Optimization. IEEE Access 2019, 7, 135757–135769. [Google Scholar] [CrossRef]
- Saleem, S.; Dilawari, A.; Khan, U.G.; Iqbal, R.; Wan, S.; Umer, T. Stateful human-centered visual captioning system to aid video surveillance. Comput. Electr. Eng. 2019, 78, 108–119. [Google Scholar] [CrossRef]
- Shi, X.; Cai, J.; Joty, S.; Gu, J. Watch it twice: Video captioning with a refocused video encoder. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 818–826. [Google Scholar] [CrossRef]
- Gui, Y.; Guo, D.; Zhao, Y. Semantic enhanced encoder-decoder network (SEN) for video captioning. In Proceedings of the 2nd Workshop on Multimedia for Accessible Human Computer Interfaces-MAHCI’19, Nice, France, 25 October 2019; pp. 25–32. [Google Scholar] [CrossRef]
- Qi, M.; Wang, Y.; Li, A.; Luo, J. Sports video captioning by attentive motion representation based hierarchical recurrent neural networks. In Proceedings of the 1st International Workshop on Multimedia Content Analysis in Sports, Seoul, Korea, 26 October 2018; pp. 77–85. [Google Scholar] [CrossRef]
- Liu, S.; Ren, Z.; Yuan, J. SibNet: Sibling convolutional encoder for video captioning. In Proceedings of the 26th ACM international conference on Multimedia, Seoul, Korea, 22–26 October 2018; Volume 2, pp. 1425–1434. [Google Scholar] [CrossRef]
- Xiao, H.; Shi, J. Video captioning using hierarchical multi-attention model. In Proceedings of the 2nd International Conference on Advances in Image Processing-ICAIP ’18, Chengdu, China, 16–18 June 2018; pp. 96–101. [Google Scholar] [CrossRef]
- Phan, S.; Miyao, Y.; Satoh, S. MANet: A modal attention network for describing videos. In Proceedings of the 25th ACM international conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1889–1894. [Google Scholar] [CrossRef]
- Chen, B.-C.; Chen, Y.-Y.; Chen, F. Video to text summary: Joint video summarization and captioning with recurrent neural networks. In Proceedings of the 2017 British Machine Vision Conference, London, UK, 4–7 September 2017; pp. 1–14. [Google Scholar] [CrossRef] [Green Version]
- Sah, S.; Kulhare, S.; Gray, A.; Venugopalan, S.; Prud’Hommeaux, E.; Ptucha, R. Semantic text summarization of long videos. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 989–997. [Google Scholar] [CrossRef]
- Dilawari, A.; Khan, M.U.G. ASoVS: Abstractive Summarization of Video Sequences. IEEE Access 2019, 7, 29253–29263. [Google Scholar] [CrossRef]
- Gao, L.; Wang, X.; Song, J.; Liu, Y. Fused GRU with semantic-temporal attention for video captioning. Neurocomputing 2020, 395, 222–228. [Google Scholar] [CrossRef]
- Hao, X.; Zhou, F.; Li, X. Scene-Edge GRU for Video Caption. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; pp. 1290–1295. [Google Scholar] [CrossRef]
- Cherian, A.; Wang, J.; Hori, C.; Marks, T.K. Spatio-temporal ranked-attention networks for video captioning. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 1606–1615. [Google Scholar] [CrossRef]
- Xu, J.; Mei, T.; Yao, T.; Rui, Y. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5288–5296. [Google Scholar]
- Karpathy, A.; Fei-Fei, L. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 664–676. [Google Scholar] [CrossRef] [Green Version]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the the 40th Annual Meeting on Association for Computational Linguistics (ACL ’02), Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Lavie, A.; Agarwal, A. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, 23 June 2007; pp. 228–231. Available online: http://acl.ldc.upenn.edu/W/W05/W05-09.pdf#page=75 (accessed on 8 October 2020).
- Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar] [CrossRef] [Green Version]
- Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, 25–26 July 2004. [Google Scholar]
- Chen, D.; Dolan, W. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, Portland, OR, USA, 19–24 June 2011; Volume 1, pp. 190–200. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.C.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv 2015, arXiv:1502.03044. [Google Scholar]
- Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R.K.; Deng, L.; Dollar, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J.C.; et al. From captions to visual concepts and back. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1473–1482. [Google Scholar]
- Guadarrama, S.; Krishnamoorthy, N.; Malkarnenkar, G.; Venugopalan, S.; Mooney, R.; Darrell, T.; Saenko, K. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2712–2719. [Google Scholar] [CrossRef]
- Park, J.; Song, C.; Han, J.-H. A study of evaluation metrics and datasets for video captioning. In Proceedings of the 2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 24–26 November 2017; pp. 172–175. [Google Scholar] [CrossRef]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar] [CrossRef] [Green Version]
- Jiang, H.; Learned-Miller, E. Face Detection with the Faster R-CNN. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 650–657. [Google Scholar] [CrossRef] [Green Version]
- Hochreiter, S.; Schmidhuber, J.J.U. Long Short-Term Memory. 1997. Available online: http://www7.informatik.tu-muenchen.de/~hochreithttp://www.idsia.ch/~juergen (accessed on 22 January 2021).
- Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 2, pp. 207–212. [Google Scholar]
- Natural Language Toolkit—NLTK 3.5 Documentation. Available online: https://www.nltk.org/ (accessed on 1 April 2021).
- TensorFlow. Available online: https://www.tensorflow.org/ (accessed on 1 April 2021).
- Keras Applications. Available online: https://keras.io/api/applications/ (accessed on 1 April 2021).
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Krishnamoorthy, N.; Malkarnenkar, G.; Mooney, R.; Saenko, K.; Guadarrama, S. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the Workshop on Vision and Natural Language Processing, Atlanta, GA, USA, 14 June 2013; pp. 10–19. Available online: http://www. aclweb.org/anthology/W13-1302 (accessed on 22 January 2021).
- Guo, J.; Liu, H.; Li, X.; Xu, D.; Zhang, Y. An Attention Enhanced Spatial–Temporal Graph Convolutional LSTM Network for Action Recognition in Karate. Appl. Sci. 2021, 11, 8641. [Google Scholar] [CrossRef]
- Peng, L.; Zhu, Q.; Lv, S.-X.; Wang, L. Effective long short-term memory with fruit fly optimization algorithm for time series forecasting. Soft Comput. 2020, 24, 15059–15079. [Google Scholar] [CrossRef]
- Peng, L.; Wang, L.; Xia, D.; Gao, Q. Effective energy consumption forecasting using empirical wavelet transform and long short-term memory. Energy 2021, 238, 121756. [Google Scholar] [CrossRef]
- Qin, Y.; Xiang, S.; Chai, Y.; Chen, H. Macroscopic–Microscopic Attention in LSTM Networks Based on Fusion Features for Gear Remaining Life Prediction. IEEE Trans. Ind. Electron. 2019, 67, 10865–10875. [Google Scholar] [CrossRef]
Model | MSVD | MSR-VTT | ||||||
---|---|---|---|---|---|---|---|---|
B-4 | M | C | R-L | B-4 | M | C | R-L | |
Att-BiL-SL (V + I3D) | 48.3 | 32.5 | 85.3 | 68.8 | 39.0 | 27.8 | 46.2 | 57.8 |
Att-BiL-SL (IV3 + I3D) | 49.1 | 33.2 | 86.4 | 70.3 | 40.8 | 28.3 | 48.0 | 59.8 |
Att-BiL-SL (X + I3D) | 51.2 | 35.7 | 86.9 | 72.1 | 41.2 | 28.7 | 49.3 | 60.4 |
Methods | B-4 | M | C | R-L |
---|---|---|---|---|
HATT [9] | 52.9 | 33.8 | 73.8 | - |
ICA-LSTM [10] | 51.3 | 35.1 | 82.9 | - |
SF-SSAG-LSTM [12] | 51.2 | 35.4 | 74.9 | - |
LSTM-GAN [13] | 42.9 | 30.4 | - | - |
DenseLSTM [17] | 50.4 | 32.9 | 72.6 | - |
CAM-RNN [20] | 42.4 | 33.4 | 54.3 | 69.4 |
VRE [23] | 51.7 | 34.3 | 86.7 | 71.9 |
MR-HRNN [25] | 50.5 | 32.7 | 69.6 | - |
STA-FG-RC [32] | 52.7 | 34.5 | - | - |
Ours [Att-BiL-SL (V + I3D)] | 48.3 | 32.5 | 85.3 | 68.8 |
Ours [Att-BiL-SL (IV3 + I3D)] | 49.1 | 33.2 | 86.4 | 70.3 |
Ours [Att-BiL-SL (X + I3D)] | 51.2 | 35.7 | 86.9 | 72.1 |
Methods | B-4 | M | C | R-L |
---|---|---|---|---|
HATT [9] | 41.2 | 28.5 | 44.7 | 60.7 |
ICA-LSTM [10] | 41.2 | 27.7 | 43.9 | - |
SF-SSAG-LSTM [12] | 40.8 | 28.7 | 46.8 | 61.5 |
LSTM-GAN [13] | 36.0 | 26.1 | - | - |
DenseLSTM [17] | 38.1 | 26.6 | 42.8 | - |
CAM-RNN [20] | 37.7 | 27.9 | 38.8 | 58.8 |
VRE [23] | 43.2 | 28.0 | 48.3 | 62.0 |
MR-HRNN [25] | 36.2 | 25.6 | 33.5 | - |
STA-FG-RC [32] | 40.8 | 27.4 | - | - |
Ours [Att-BiL-SL (V + I3D)] | 39.0 | 27.8 | 46.2 | 57.8 |
Ours [Att-BiL-SL (IV3 + I3D)] | 40.8 | 28.3 | 48.0 | 59.8 |
Ours [Att-BiL-SL (X + I3D)] | 41.2 | 28.7 | 49.3 | 60.4 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahmed, S.; Saif, A.F.M.S.; Hanif, M.I.; Shakil, M.M.N.; Jaman, M.M.; Haque, M.M.U.; Shawkat, S.B.; Hasan, J.; Sonok, B.S.; Rahman, F.; et al. Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual Formation. Appl. Sci. 2022, 12, 317. https://doi.org/10.3390/app12010317
Ahmed S, Saif AFMS, Hanif MI, Shakil MMN, Jaman MM, Haque MMU, Shawkat SB, Hasan J, Sonok BS, Rahman F, et al. Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual Formation. Applied Sciences. 2022; 12(1):317. https://doi.org/10.3390/app12010317
Chicago/Turabian StyleAhmed, Shakil, A F M Saifuddin Saif, Md Imtiaz Hanif, Md Mostofa Nurannabi Shakil, Md Mostofa Jaman, Md Mazid Ul Haque, Siam Bin Shawkat, Jahid Hasan, Borshan Sarker Sonok, Farzad Rahman, and et al. 2022. "Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual Formation" Applied Sciences 12, no. 1: 317. https://doi.org/10.3390/app12010317
APA StyleAhmed, S., Saif, A. F. M. S., Hanif, M. I., Shakil, M. M. N., Jaman, M. M., Haque, M. M. U., Shawkat, S. B., Hasan, J., Sonok, B. S., Rahman, F., & Sabbir, H. M. (2022). Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual Formation. Applied Sciences, 12(1), 317. https://doi.org/10.3390/app12010317