Supervised Deep Learning Techniques for Image Description: A Systematic Review
Abstract
:1. Introduction
2. Background
2.1. Image Description
- Alt-text generation for visually impaired people. Blind and low-vision individuals can understand webpage images or real-world scenes by automatically converting an image into text and describing the image using a text-to-speech system. This technique may allow visually impaired people to obtain as much information as possible about the contents of a photograph.
- Content-based image retrieval (CBIR). It consists of recovering a specific subset of pictures (or a single image) from relevant keywords reflecting the visual content found in the image. CBIR and feature extraction approaches are applied in various applications, such as medical image analysis, remote sensing, crime detection, video analysis, military surveillance, and the textile industry.
2.2. Convolutional Neural Networks
- Convolution layer: Feature extraction is carried out through filters called kernels, each generally followed by a ReLU layer.
- Pooling layer: A sweep is applied to obtain statistical information, thus reducing the vector that represents the processed image.
- Flattening layer: Finally, a flattening layer is applied to change the matrix in a one-dimensional vector; the resulting vector will be the one that feeds the neural network to perform detection or classification tasks.
2.3. Recurrent Neural Networks
2.4. Long Short-Term Memory
- Block input: updates the block input component, which combines the current input and the output of that LSTM unit in the last iteration.
- Input gate: combines the current input , the output of that LSTM unit , and in the last iteration.
- Forget gate: the LSTM unit determines which information should be removed from its previous cell states .
- Cell: computes the cell value, which combines the block input , the input gate , and the forget gate values with the previous cell value.
- Output gate: calculates the output gate, which combines the current input , the output of that LSTM unit , and the cell value in the last iteration.
- Output block: combines the current cell value with the current output gate value.
2.5. Encoder-Decoder Approach
- Select a dataset. It is necessary to use a dataset that includes a large collection of images (in the range of one thousand images), each with several subtitles providing a precise description of its content.
- Encoder (feature extraction model). CNNs are the de facto tool that extracts the input image features. A CNN performs dimensionality reduction, where the pixels of the image are represented in such a way that interesting parts of the figure are captured effectively into extracted feature signals. Currently, we can take on this task by one of the following:
- Training the CNN directly on the images in the image captioning dataset;
- Using a pre-trained image classification model, such as the VGG model, ResNEt50 model, Inception V3, or EfficientNetB7.
- The extracted feature signals are represented in a fixed-length encoding vector. This fixed-length vector includes a rich representation of the input image.
- Decoder (language model). RNNs are the de facto tool for working with sequence prediction problems. At this stage, the RNN takes the fixed-length encoding vector and predicts the probability of the next word in a sequence, given a list of words present in that sequence. The output is then the natural language description for the image. Currently, LSTM networks are a commonly used RNN architecture, as they allow for encapsulation of a broader sequence of words or sentences than conventional RNNs.
3. Methodology
- (1)
- Planning the research. The first phase consists of making the appropriate research questions to identify the topic for research. This work examines the following research questions (RQ):
- RQ1: What is the custom architecture implemented in the encoder-decoder approach used in the research papers?
- RQ2: Which datasets are used to train and test the models?
- RQ3: Which metrics were used to evaluate the obtained results?
These three questions form the basis for developing a research strategy for the literature extraction.Following the research question definition, the next activity in the planning phase involves a selection of sources to define the search strategy. For this study, the following international online bibliographic databases were selected:- IEEE Xplore.
- ACM Digital Library.
- ScienceDirect.
- Springer.
The searches were limited to peer-reviewed publications written in English and released between 2014 and 2022. The keywords used for the searches were “Encoder-decoder for automatic image description’,’ “Encoder-decoder for automatic image captioning”, “deep learning for image description”, and “Evaluation of image description generator models”. The last step in the planning phase was selecting and evaluating the research papers. During this phase, an initial selection was made involving the selection of titles, keywords, and abstracts of the possible primary studies.This first phase allows us to identify the current view of the research problem, define the latest models and approaches used to solve it, and confirm that it is of current interest. - (2)
- Conducting the search. The second phase regarding the methodology is to perform document extraction and analyses from online databases. The following requirements were applied to ensure that the findings were appropriately classified:
- The design and implementation of a deep learning model for image description is the central topic that this study proposes.
- The primary studies report all of the essential components on which an encoder-decoder approach for image description is built.
- The primary studies report all of the metrics used to evaluate the image description model.
- The research articles mention the datasets employed.
To obtain a concise list of articles, a comparison check was conducted to detect duplicate papers. In addition, an analysis of the introduction and conclusion sections was mandatory to know which papers to select or discard. After analyzing and evaluating 91 research articles, 53 were chosen based on their relevance to the topic of study. - (3)
- Presentation of the review report. The final phase of the systematic review framework was to derive analytical results from answering research questions. This is presented in the next section.
4. Review and Discussion
4.1. Main Architectures
- CNN + RNN. In this architecture, a CNN is employed to extract the characteristics of the image, while an RNN is employed to generate the description. A total of 10 works out of 53 (19%) follow this method. It is worth noticing that this architecture is used in the first works on automatic image description using the encoder-decoder approach.
- CNN + LSTM. This architecture uses a CNN encoder and an RNN with LSTM modules to prevent vanishing gradient. Most of the works, including the recent ones (41 out of 53, representing 77%), followed this method.
- CNN + CNN. This architecture uses two CNNs; the first is for extracting characteristics from the image, and the second is for generating the image description from the first CNN results. Only two works (4%) follow this method.
4.2. CNN + RNN Architecture
4.3. CNN + LSTM Architecture
4.3.1. Attention-Based Image Description
4.3.2. Semantic-Based Image Description
4.3.3. Reinforcement Learning-Based Image Description
4.4. CNN + CNN Architecture
4.5. Datasets
- MS COCO [68]. The Microsoft Common Objects in Context (COCO) caption dataset was developed by the Microsoft team; it is aimed at scene understanding. It contains images of complex daily scenes and can perform multiple tasks, such as image recognition, segmentation, and description. The dataset contains 165,482 images and a text file containing nearly one million descriptions. This is the most used dataset, with 77% (41 out of 53) of the revised works using it.
- Flickr8k/Flickr30k [69,70]. The images in the Flickr8k set come from Yahoo’s photo album website, Flickr, which contains 8000 photos. Flickr30k is the extended version of the previous dataset and contains 31,783 images collected from the Flickr website. They usually capture real-world scenes and contain five descriptions for each image. Among the revised works, the Flicker8k dataset occupies second place as it is used in 13 of the 53 works, which is equivalent to 25%. Flickr30k takes third place as it is used in 11 works, which is equivalent to 21%.
- The IAPR–TC12 Dataset [72]. This dataset has 20,000 images. These are collected from various sources, such as sports, photographs of people, animals, and landscapes. The images in this dataset contain captions in multiple languages. Three works use this dataset [10,11,25], corresponding to 6% of the total.
- The SBU Captions Dataset [74]. SBU can be considered an old dataset that contains images and short text descriptions. This dataset is used to induce word embeddings learned from both images and text. This dataset contains one million images with the associated visually relevant captions. Only one work uses this dataset [10], corresponding to 2% of the total.
- The PASCAL Dataset [75]. This dataset provides a standardized image dataset for object class recognition and a common set of tools for accessing the datasets annotations, and it enables the evaluation and comparison of different methods. Since the dataset is an annotation of PASCAL VOC 2010, it has the same statistics as the original dataset. The training and validation sets contain 10,103 images, while the testing set contains 9637 images. Three works [22,35,49] (6%) use this dataset.
- UIUC [76]. This dataset contains eight sports events categories: rowing (250 images), badminton (200 images), polo (182 images), bocce (137 images), snowboarding (190 images), croquet (236 images), sailing (190 images), and rock climbing (194 images). The images are divided into easy and medium according to the human subject judgment. Information on the distance of the foreground objects is also provided for each image. Only one work uses this dataset [50] (2% of the total).
4.6. Evaluation Metrics
- BLEU (Bilingual evaluation understudy) [78]. This is the most widely used metric in practice. The original purpose of this metric is not the image description problem but the machine translation problem. Based on the evaluation of the accuracy rate, it is used to analyze the correlation of n-gram (continuous sequences of words in a document) matches between the system-generated translation and the reference translation statement. It consists of detecting the number of individual words that match. A total of 100% of the papers presented in this review used this metric.The BLEU metric first computes the geometric average of the modified n-gram precision, , using n-grams up to a length N and the positive weights plus one. Next, let c be the length of the candidate translation, and let r the effective length of the reference corpus. Calculate the brevity penalty BP. The BLEU metric ranges from 0 to 1. Few sentences will attain a score of one unless they are identical to the reference sentences:
- ROUGE (Recall-oriented understudy for gisting evaluation) [79]. This is a set of metrics commonly used to evaluate automatic summaries and translation tasks. Again, it is based on comparing n-grams between a hypothesis against one or several references. This metric is used in 33 out of 53 works, which is equivalent to 62%. ROUGE is an n-gram recall between a candidate summary and a set of reference summaries. Formally, ROUGE-L is computed as follows:
- METEOR (Metric for evaluation of translation with explicit ordering) [80]. This is also a metric used to evaluate the result of machine translation. As with BLEU, the basic unit of evaluation is the sentence. The algorithm first creates an alignment between the translation generated from the machine translation model with the reference translation statement. This metric is used in 47 out of 53 works, which is equivalent to 89%.The METEOR score for this pairing is computed as follows: Based on the number of mapped n-grams found between the two strings (m), the total number of unigrams in the translation (t), and the total number of unigrams in the reference (r), calculate the precision of the unigram and retrieve the unigram . Then, calculate a parameterized harmonic mean of P and R:Meteor computes a penalty for a given alignment, as follows. First, the sequence of matched unigrams between the two strings is divided into the fewest possible number of “chunks” such that the matched unigrams in each chunk are adjacent (in both strings) and in identical word order. The number of chunks () and the number of matches (m) is then used to calculate a fragmentation fraction: . The penalty is then computed as:The value of determines the maximum penalty (). The value of determines the functional relation between fragmentation and the penalty. Finally, the METEOR score for the alignment between the two strings is calculated as follows:
- CIDEr (Consensus-based image description evaluation) [81]. This is a metric especially designed to evaluate image descriptions. All of the words in the description (both candidates and references) are transformed to their respective lemma or root to broaden the search for unigrams to not just exact matches. This metric is used in 15 out of 53 works, which is equivalent to 25%. The CIDEr formula is as follows:
- SPICE (Semantic propositional image caption evaluation) [82]. This is used to measure the efficiency of the models in which the photographs’ titles are compared with the objects included. This metric is used in 24 out of 53 works, which is equivalent to 45%.The evaluation is as follows: Given a candidate caption c and a set of reference captions associated with an image, the goal is to compute a score that captures the similarity between c and S.It defines the subtask of parsing captions to scene graphs as follows. Given a set of object classes C, a set of relation types R, a set of attribute types A, and a caption c, it parses c to a scene graph:To evaluate the similarity of candidate and reference scene graphs, it defines the function T that returns logical tuples from a scene graph as:Then, it defines the binary matching operator ⊗ as the function that returns matching tuples in two scene graphs. It then defines the precision P, recall R, and SPICE as:Given the F-score, SPICE is simple to understand and easily interpretable as it is bounded between 0 and 1.
- SPIDEr [58]. This metric is the result of combining the properties of CIDEr and SPICE. This combination is made using a gradient method that optimizes a linear combination of both metrics. It was proposed to overcome some problems attributed to the existing metrics. This metric is used in 34 out of 53 works, which is equivalent to 64%.SPIDEr uses a policy gradient (PG) method to optimize a linear combination of SPICE and CIDEr directly:
- mrank (Matrix rank) [83]. This metric helps to measure the average rank of the correct description for each image, using the median rank among all sentences. This metric is used in 5 out of 53 works, which is equivalent to 9%.The mrank metric is used by some authors to evaluate the detection phase of objects performed for their later description. First, the average recall is calculated using the formula:First, the intersection over union (IoU) metric is calculated as follows:The is used to measure the accuracy of an object detector, and it is defined as the area of the intersection divided by the area of the union of a predicted bounding box and a ground-truth box .Then, the recall metric is calculated as follows:Then, the average recall (AR) is calculated:The AR is the recall averaged over all IoUs , where the IoU is represented by o.Finally, calculate the mean of the average recall (mAR) across all K classes.
- PPLX (Perplexity) [10]. This metric was proposed by Kiros et al. to evaluate the effectiveness of using pre-trained word embeddings. Solely this work employs this metric [10].PPLX is not only used as a measure of performance but also as a link between a text and the additional modality:
5. Conclusions and Future Directions
- The prevalent architecture for automatic image description employs a CNN as an encoder and an LSTM network as a decoder.
- The most used dataset for training and evaluating models is the MS COCO dataset, used by almost all of the reviewed papers.
- All of the papers in the review use more than one metric to compare the performance of the proposed models, highlighting BLEU and METEOR as the most used metrics.
- Multilingual models: The models and advances in the automatic generation of image descriptions have focused solely on the English language. Studying different languages or multilingual datasets would be interesting.
- Amount of data for training: Most of the current models use the supervised learning approach, so they need a large amount of labeled data. For this reason, semi-supervised, unsupervised, and reinforcement learning will be more prevalent in creating future models for generating automatic image descriptions.
- Variety of datasets: The accuracy of the descriptions generated by the existing models depends on the dataset used, and there are few available. It would be interesting to have more and increasingly diverse datasets for future research in this field.
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
CNN | Convolutional Neural Network |
RNN | Recurrent Neural Network |
GAN | Generative Adversarial Network |
MS COCO | Microsoft Common Objects in Context |
LSTM | Long Short-Term Memory |
BLEU | Bilingual Evaluation Understudy |
RQ | Research Questions |
IEEE | Institute of Electrical and Electronics Engineers |
ACM | Association for Computing Machinery |
ROUGE | Recall-Oriented Understudy for Gisting Evaluation |
METEOR | Metric for Evaluation of Translation with Explicit Ordering |
CIDEr | Consensus-based Image Description Evaluation |
SPICE | Semantic Propositional Image Caption Evaluation |
PPLX | Perplexity |
IoU | Intersection Over Union |
MRANK | Matrix Rank |
CBIR | Content-Based Image Retrieval |
References
- Wang, H.; Qin, Z.; Wan, T. Text Generation Based on Generative Adversarial Nets with Latent Variables. In Proceedings of the Advances in Knowledge Discovery and Data Mining, Melbourne, VIC, Australia, 3–6 June 2018; Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 92–103. [Google Scholar]
- Dai, B.; Fidler, S.; Urtasun, R.; Lin, D. Towards Diverse and Natural Image Descriptions via a Coplease confirm the added informationnditional GAN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Shetty, R.; Rohrbach, M.; Anne Hendricks, L.; Fritz, M.; Schiele, B. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Mohamad Nezami, O.; Dras, M.; Wan, S.; Paris, C.; Hamey, L. Towards Generating Stylized Image Captions via Adversarial Training. In Proceedings of the PRICAI 2019: Trends in Artificial Intelligence, Cuvu, Yanuca Island, Fiji, 26–30 August 2019; Nayak, A.C., Sharma, A., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 270–284. [Google Scholar]
- Jiang, W.; Li, X.; Hu, H.; Lu, Q.; Liu, B. Multi-Gate Attention Network for Image Captioning. IEEE Access 2021, 9, 69700–69709. [Google Scholar] [CrossRef]
- Association, T.A.A. Guidelines for Creating Image; The American Anthropological Association: Arlington, VI, USA, 2019. [Google Scholar]
- Amirian, S.; Rasheed, K.; Taha, T.R.; Arabnia, H.R. Automatic Image and Video Caption Generation with Deep Learning: A Concise Review and Algorithmic Overlap. IEEE Access 2020, 8, 218386–218400. [Google Scholar] [CrossRef]
- Zhang, L.; Sung, F.; Liu, F.; Xiang, T.; Gong, S.; Yang, Y.; Hospedales, T.M. Actor-critic sequence training for image captioning. arXiv 2017, arXiv:1706.09601. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Bach, F., Blei, D., Eds.; PMLR: Lille, France, 2015; Volume 37, pp. 2048–2057. [Google Scholar]
- Kiros, R.; Salakhutdinov, R.; Zemel, R. Multimodal Neural Language Models. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Xing, E.P., Jebara, T., Eds.; PMLR: Bejing, China, 2014; Volume 32, pp. 595–603. [Google Scholar]
- Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Yuille, A.L. Explain images with multimodal recurrent neural networks. arXiv 2014, arXiv:1410.1090. [Google Scholar]
- Wang, Q.; Chan, A.B. Cnn+ cnn: Convolutional decoders for image captioning. arXiv 2018, arXiv:1805.09019. [Google Scholar]
- Chen, X.; Lawrence Zitnick, C. Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- LeCun, Y.; Boser, B.; Denker, J.; Henderson, D.; Howard, R.; Hubbard, W.; Jackel, L. Handwritten Digit Recognition with a Back-Propagation Network. In Proceedings of the Advances in Neural Information Processing Systems; Touretzky, D., Ed.; Morgan-Kaufmann: Burlington, MA, USA, 1989; Volume 2. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
- Sarkar, D.; Bali, R.; Sharma, T. Practical Machine Learning with Python; Apress: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
- Pascanu, R.; Gulcehre, C.; Cho, K.; Bengio, Y. How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: A Neural Image Caption Generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A Search Space Odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 2222–2232. [Google Scholar] [CrossRef] [Green Version]
- Houdt, G.V.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955. [Google Scholar] [CrossRef]
- Kiros, R.; Salakhutdinov, R.; Zemel, R.S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv 2014, arXiv:1411.2539. [Google Scholar]
- Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R.K.; Deng, L.; Dollar, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J.C.; et al. From Captions to Visual Concepts and Back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Karpathy, A.; Fei-Fei, L. Deep Visual-Semantic Alignments for Generating Image Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Kitchenham, B. Procedures for Performing Systematic Reviews; Technical Report; Keele University: Keele, UK, 2004. [Google Scholar]
- Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Yuille, A.L. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Jia, X.; Gavves, E.; Fernando, B.; Tuytelaars, T. Guiding the Long-Short Term Memory Model for Image Caption Generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Fu, K.; Jin, J.; Cui, R.; Sha, F.; Zhang, C. Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2321–2334. [Google Scholar] [CrossRef]
- Johnson, J.; Karpathy, A.; Fei-Fei, L. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A.L.; Murphy, K. Generation and Comprehension of Unambiguous Object Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Yang, L.; Tang, K.; Yang, J.; Li, L.J. Dense Captioning with Joint Inference and Visual Context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Hendricks, L.A.; Venugopalan, S.; Rohrbach, M.; Mooney, R.; Saenko, K.; Darrell, T. Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Gan, Z.; Gan, C.; He, X.; Pu, Y.; Tran, K.; Gao, J.; Carin, L.; Deng, L. Semantic Compositional Networks for Visual Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Tavakoli, H.R.; Shetty, R.; Borji, A.; Laaksonen, J. Paying Attention to Descriptions Generated by Image Captioning Models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Gu, J.; Wang, G.; Cai, J.; Chen, T. An Empirical Study of Language CNN for Image Captioning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-Critical Sequence Training for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Venugopalan, S.; Anne Hendricks, L.; Rohrbach, M.; Mooney, R.; Darrell, T.; Saenko, K. Captioning Images with Diverse Objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Huang, L.; Wang, W.; Chen, J.; Wei, X.Y. Attention on Attention for Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-Memory Transformer for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; Gao, J. Unified Vision-Language Pre-Training for Image Captioning and VQA. Proc. AAAI Conf. Artif. Intell. 2020, 34, 13041–13049. [Google Scholar] [CrossRef]
- Pan, Y.; Yao, T.; Li, Y.; Mei, T. X-Linear Attention Networks for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Klein, F.; Mahajan, S.; Roth, S. Diverse Image Captioning with Grounded Style. In Proceedings of the Pattern Recognition: 43rd DAGM German Conference, DAGM GCPR 2021, Bonn, Germany, 28 September–1 October 2021; Springer: Berlin/Heidelberg, Germany, 2022; pp. 421–436. [Google Scholar]
- Karpathy, A.; Joulin, A.; Fei-Fei, L.F. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. In Proceedings of the Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K., Eds.; Curran Associates, Inc.: New York, NY, USA, 2014; Volume 27. [Google Scholar]
- Yang, Z.; Yuan, Y.; Wu, Y.; Cohen, W.W.; Salakhutdinov, R.R. Review Networks for Caption Generation. In Proceedings of the Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2016; Volume 29. [Google Scholar]
- Sugano, Y.; Bulling, A. Seeing with humans: Gaze-assisted neural image captioning. arXiv 2016, arXiv:1608.05203. [Google Scholar]
- Mathews, A.; Xie, L.; He, X. SentiCap: Generating Image Descriptions with Sentiments. Proc. AAAI Conf. Artif. Intell. 2016, 30. [Google Scholar] [CrossRef]
- Wang, M.; Song, L.; Yang, X.; Luo, C. A parallel-fusion RNN-LSTM architecture for image caption generation. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 4448–4452. [Google Scholar] [CrossRef]
- Tran, K.; He, X.; Zhang, L.; Sun, J.; Carapcea, C.; Thrasher, C.; Buehler, C.; Sienkiewicz, C. Rich Image Captioning in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Ma, S.; Han, Y. Describing images by feeding LSTM with structural words. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar] [CrossRef]
- You, Q.; Jin, H.; Wang, Z.; Fang, C.; Luo, J. Image Captioning with Semantic Attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; Mei, T. Boosting Image Captioning with Attributes. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4904–4912. [Google Scholar]
- Pedersoli, M.; Lucas, T.; Schmid, C.; Verbeek, J. Areas of Attention for Image Captioning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Ren, Z.; Wang, X.; Zhang, N.; Lv, X.; Li, L.J. Deep Reinforcement Learning-Based Image Captioning with Embedding Reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Wang, Y.; Lin, Z.; Shen, X.; Cohen, S.; Cottrell, G.W. Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Liu, C.; Mao, J.; Sha, F.; Yuille, A. Attention Correctness in Neural Image Captioning. Proc. AAAI Conf. Artif. Intell. 2017, 31. [Google Scholar] [CrossRef]
- Gan, C.; Gan, Z.; He, X.; Gao, J.; Deng, L. StyleNet: Generating Attractive Visual Captions with Styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Liu, S.; Zhu, Z.; Ye, N.; Guadarrama, S.; Murphy, K. Improved Image Captioning via Policy Gradient Optimization of SPIDEr. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Mei, T. Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Wu, Q.; Shen, C.; Wang, P.; Dick, A.; Hengel, A.v.d. Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1367–1381. [Google Scholar] [CrossRef] [Green Version]
- Aneja, J.; Deshpande, A.; Schwing, A.G. Convolutional Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Ding, S.; Qu, S.; Xi, Y.; Wan, S. Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 2020, 398, 520–530. [Google Scholar] [CrossRef]
- Yang, L.; Wang, H.; Tang, P.; Li, Q. CaptionNet: A Tailor-made Recurrent Neural Network for Generating Image Descriptions. IEEE Trans. Multimed. 2021, 23, 835–845. [Google Scholar] [CrossRef]
- Zhong, W.; Miyao, Y. Leveraging Partial Dependency Trees to Control Image Captions. In Proceedings of the Second Workshop on Advances in Language and Vision Research, Online; Association for Computational Linguistics: Cedarville, OH, USA, 2021; pp. 16–21. [Google Scholar] [CrossRef]
- Tian, P.; Mo, H.; Jiang, L. Image Caption Generation Using Multi-Level Semantic Context Information. Symmetry 2021, 13, 1184. [Google Scholar]
- Deng, Y.; Li, Y.; Zhang, Y.; Xiang, X.; Wang, J.; Chen, J.; Ma, J. Hierarchical Memory Learning for Fine-Grained Scene Graph Generation. In Proceedings of the Computer Vision–ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 266–283. [Google Scholar]
- Fei, Z. Efficient Modeling of Future Context for Image Captioning. In Proceedings of the 30th ACM International Conference on Multimedia, ACM, Lisboa, Portugal, 10–14 October 2022. [Google Scholar] [CrossRef]
- Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft coco captions: Data collection and evaluation server. arXiv 2015, arXiv:1504.00325. [Google Scholar]
- Hodosh, M.; Young, P.; Hockenmaier, J. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. J. Artif. Intell. Res. 2013, 47, 853–899. [Google Scholar] [CrossRef] [Green Version]
- Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
- Grubinger, M.; Clough, P.; Müller, H.; Deselaers, T. The IAPR TC12 Benchmark: A New Evaluation Resource for Visual Information Systems. Workshop Ontoimage 2006, 2. Available online: https://www.cs.brandeis.edu/~marc/misc/proceedings/lrec-2006/workshops/W02/RealFinalOntoImage2006-2.pdf#page=13 (accessed on 5 March 2023).
- Bychkovsky, V.; Paris, S.; Chan, E.; Durand, F. Learning photographic global tonal adjustment with a database of input/output image pairs. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 97–104. [Google Scholar] [CrossRef]
- Ordonez, V.; Kulkarni, G.; Berg, T. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Proceedings of the Advances in Neural Information Processing Systems; Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K., Eds.; Curran Associates, Inc.: New York, NY, USA, 2011; Volume 24. [Google Scholar]
- Everingham, M.; Eslami, S.M.A.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2014, 111, 98–136. [Google Scholar] [CrossRef]
- Li, L.J.; Fei-Fei, L. What, where and who? Classifying events by scene and object recognition. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio De Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef] [Green Version]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; Association for Computational Linguistics: Cedarville, OH, USA, 2002; pp. 311–318. [Google Scholar] [CrossRef] [Green Version]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Lavie, A.; Agarwal, A. Meteor: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. In Proceedings of the Second Workshop on Statistical Machine Translation; Association for Computational Linguistics: Cedarville, OH, USA, 2007; pp. 228–231. [Google Scholar]
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. CIDEr: Consensus-Based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic Propositional Image Caption Evaluation. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 382–398. [Google Scholar]
- Socher, R.; Karpathy, A.; Le, V.; Manning, C.; Ng, A. Grounded Compositional Semantics for Finding and Describing Images with Sentences. Trans. Assoc. Comput. Linguist. 2014, 2, 207–218. [Google Scholar] [CrossRef]
Year, Author | Architecture | Datasets | Evaluation Metrics |
---|---|---|---|
2014, Karpathy et al. [44] | CNN + RNN | Flickr 8k/Flickr 30K | mRank |
2014, Mao et al. [11] | CNN + RNN | Flickr 8k/Flickr 30K, IAPR TC-12 | BLEU, mRank |
2014, Kiros et al. [10] | CNN + RNN | IAPR TC-12, SBU | BLEU, PPLX |
2014, Kiros et al. [21] | CNN + RNN | Flickr 8K, Flickr 30K | mRank |
2015, Chen et al. [13] | CNN + RNN | PASCAL, Flickr 8K/30K, MS COCO | BLEU, METEOR, CIDEr |
2015, Mao et al. [25] | CNN + RNN | IAPR TC-12, Flickr 8K/ Flickr 30K | BLEU, mRank |
2015, Fang et al. [22] | CNN + RNN | PASCAL, MS COCO | BLEU, METEOR |
2015, Karpathy et al. [23] | CNN + RNN | Flickr 8K/30K, MS COCO | BLEU, METEOR, CIDEr |
2015, Vinyals et al. [18] | CNN + LSTM | Flickr 8K/30K, MS COCO | BLEU, METEOR, CIDEr |
2015, Jia et al. [26] | CNN + LSTM | Flickr 8K/30K, MS COCO | BLEU, METEOR, CIDEr |
2015, Xu et al. [9] | CNN + LSTM | Flickr 8K/30K, MS COCO | BLEU, METEOR |
2015, Jin et al. [27] | CNN + LSTM | Flickr 8K/30K, MS COCO | BLEU, METEOR, ROUGE, CIDEr |
2016, Yang et al. [45] | CNN + RNN | MS COCO | BLEU, METEOR, CIDEr |
2016, Sugano et at. [46] | CNN + LSTM | MS COCO | BLEU, METEOR, ROUGE, CIDEr |
2016, Mathews et al. [47] | CNN + LSTM | MS COCO | BLEU, METEOR, ROUGE, CIDEr |
2016, Wang et al. [48] | CNN + LSTM | Flickr 8K/30K, MS COCO | BLEU, mRank |
2016, Johnson et al. [28] | CNN + LSTM | Visual Genome | METEOR |
2016, Mao et al. [29] | CNN + LSTM | MS COCO | BLEU, METEOR, CIDEr |
2016, Tran et al. [49] | CNN + LSTM | MS COCO, MIT-Adobe FiveK | Human Evaluation |
2016, Ma et al. [50] | CNN + LSTM | Flickr 8k, UIUC | BLEU, mRank |
2016, You et al. [51] | CNN + LSTM | Flickr 30K, MS COCO | BLEU, METEOR, ROUGE, CIDEr |
2016, Yang et al. [30] | CNN + LSTM | Visual Genome | METEOR |
2016, Anne et al. [31] | CNN + LSTM | MS COCO, ImageNet | BLEU, METEOR |
2017, Yao et al. [52] | CNN + LSTM | MS COCO | BLEU, METEOR, ROUGE, CIDEr |
2017, Lu et al. [32] | CNN + LSTM | Flickr 30K, MS COCO | BLEU, METEOR, CIDEr |
2017, Chen et al. [33] | CNN + LSTM | Flickr 8K/30K, MS COCO | BLEU, METEOR, ROUGE, CIDEr |
2017, Gan et al. [34] | CNN + LSTM | Flickr 30K, MS COCO | BLEU, METEOR, CIDEr |
2017, Pedersoli et al. [53] | CNN + LSTM | MS COCO | BLEU, METEOR, CIDEr |
2017, Ren et al. [54] | CNN + LSTM | MS COCO | BLEU, METEOR, ROUGE, CIDEr |
2017, Wang et al. [55] | CNN + LSTM | MS COCO, Stock3M | SPICE, METEOR, ROUGE, CIDEr |
2017, Tavakoli et al. [35] | CNN + LSTM | MS COCO, PASCAL | BLEU, METEOR, ROUGE, CIDEr |
2017, Liu et al. [56] | CNN + LSTM | Flickr 30K, MS COCO | BLEU, METEOR |
2017, Gan et al. [57] | CNN + LSTM | Flickr 30K | BLEU, METEOR, ROUGE, CIDEr |
2017, Liu et al. [58] | CNN + LSTM | MS COCO | SPIDEr, Human Evaluation |
2017, Gu et al. [36] | CNN + LSTM | Flickr 30K, MS COCO | BLEU, METEOR, CIDEr, SPICE |
2017, Yao et al. [59] | CNN + LSTM | MS COCO, ImageNet | METEOR |
2017, Rennie et al. [37] | CNN + LSTM | MS COCO | BLEU, METEOR, CIDEr, ROUGE |
2017, Venugopalan et al. [38] | CNN + LSTM | MS COCO, ImageNet | METEOR |
2017, Zhang et al. [8] | CNN + LSTM | MS COCO | BLEU, METEOR, ROUGE, CIDEr |
2018, Wu et al. [60] | CNN + LSTM | Flickr 8K/30K, MS COCO | BLEU, METEOR, CIDEr |
2018, Aneja et al. [61] | CNN + CNN | MS COCO | BLEU, METEOR, ROUGE, CIDEr |
2018, Wang et al. [12] | CNN + CNN | MS COCO | BLEU, METEOR, ROUGE, CIDEr |
2019, Huang et al. [39] | CNN + LSTM | MS COCO | BLEU, METEOR, ROUGE, CIDEr, SPICE |
2020, Cornia et al. [40] | CNN + LSTM | MS COCO | BLEU, METEOR, ROUGE, CIDEr, SPICE |
2020, Zhou et al. [41] | CNN + RNN | MS COCO, Flick 30K | BLEU, METEOR, CIDEr, SPICE |
2020, Ding et al. [62] | CNN + LSTM | MS COCO, Flick 30K | BLEU, METEOR, ROUGE, CIDEr |
2020, Pan et al. [42] | CNN + LSTM | MS COCO | BLEU, METEOR, ROUGE, CIDEr, SPICE |
2020, Yang et al. [63] | CNN + LSTM | MS COCO, Flick 30K | BLEU, METEOR, ROUGE, CIDEr, SPICE |
2021, Zhong et al. [64] | CNN + LSTM | MS COCO, Flick 30K | BLEU, METEOR, ROUGE, CIDEr, SPICE |
2021, Tian et al. [65] | CNN + LSTM | MS COCO, Flick 30K | BLEU, METEOR, ROUGE, CIDEr, SPICE |
2022, Klein et al. [43] | CNN + LSTM | MS COCO | BLEU, METEOR, ROUGE, CIDEr |
2022, Deng et al. [66] | CNN + LSTM | Visual Genome | mRank |
2022, Fei [67] | CNN + LSTM | MS COCO | BLEU, METEOR, ROUGE, CIDEr, SPICE |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
López-Sánchez, M.; Hernández-Ocaña, B.; Chávez-Bosquez, O.; Hernández-Torruco, J. Supervised Deep Learning Techniques for Image Description: A Systematic Review. Entropy 2023, 25, 553. https://doi.org/10.3390/e25040553
López-Sánchez M, Hernández-Ocaña B, Chávez-Bosquez O, Hernández-Torruco J. Supervised Deep Learning Techniques for Image Description: A Systematic Review. Entropy. 2023; 25(4):553. https://doi.org/10.3390/e25040553
Chicago/Turabian StyleLópez-Sánchez, Marco, Betania Hernández-Ocaña, Oscar Chávez-Bosquez, and José Hernández-Torruco. 2023. "Supervised Deep Learning Techniques for Image Description: A Systematic Review" Entropy 25, no. 4: 553. https://doi.org/10.3390/e25040553
APA StyleLópez-Sánchez, M., Hernández-Ocaña, B., Chávez-Bosquez, O., & Hernández-Torruco, J. (2023). Supervised Deep Learning Techniques for Image Description: A Systematic Review. Entropy, 25(4), 553. https://doi.org/10.3390/e25040553