A Systematic Literature Review on Image Captioning

: Natural language problems have already been investigated for around ﬁve years. Recent progress in artiﬁcial intelligence (AI) has greatly improved the performance of models. However, the results are still not su ﬃ ciently satisfying. Machines cannot imitate human brains and the way they communicate, so it remains an ongoing task. Due to the increasing amount of information on this topic, it is very di ﬃ cult to keep on track with the newest researches and results achieved in the image captioning ﬁeld. In this study a comprehensive Systematic Literature Review (SLR) provides a brief overview of improvements in image captioning over the last four years. The main focus of the paper is to explain the most common techniques and the biggest challenges in image captioning and to summarize the results from the newest papers. Inconsistent comparison of results achieved in image captioning was noticed during this study and hence the awareness of incomplete data collection is raised in this paper. Therefore, it is very important to compare results of a newly created model produced with the newest information and not only with the state of the art methods. This SLR is a source of such information for researchers in order for them to be precisely correct on result comparison before publishing new achievements in the image caption generation ﬁeld.


Introduction
Ever since researchers started working on object recognition in images, it became clear that only providing the names of the objects recognized does not make such a good impression as a full human-like description. As long as machines do not think, talk, and behave like humans, natural language descriptions will remain a challenge to be solved. There have been many variations and combinations of different techniques since 2014-the very first application of neural networks in image captioning is in ref. [1]. Four successful articles [2][3][4][5], which now are the most cited articles researchers rely on, were published in 2015. There was not much interest in this area in 2014 and 2015, but it is clear from this review how exponentially the popularity is growing-57 articles found were published in 2017-2018 and already 17 were published during the first three months of 2019. The advantages and the future of human-like technologies are undoubtable; from enabling computers to interact with humans, to specific applications for child education, health assistants for the elderly or visually disabled people, and many more. While having so many opportunities for meaningful applications in society, not surprisingly many studies have already tried to obtain more accurate descriptions and make machines think like humans. However, machines still lack the natural way of human communication and this continues to be a challenging task to tackle. Our work is meant to summarize the newest articles and to give insight on the latest achievements and the highest number of results to ease the work of new researchers who would like to utilize their efforts to build better methods. This paper is a systematic literature review (SLR) of the newest articles in order to provide a summarized understanding of what has been achieved in this field so far and which techniques have performed the best. Special attention was given to result collection and year to year comparison. We hope this work will help further researchers to find more innovative and newer ways to achieve better results. The following paper has been divided into four additional parts. First, we present the research methods which have been used to make this SLR. Second, we introduce readers to summarized tables of all the articles and results achieved in them. The purpose of the discussion section is to introduce readers to the most popular methodologies and innovative solutions in image captioning. Finally, the paper is concluded with some open questions for future studies.

SLR Methodology
The SLR has become a great help in the dynamic, data driven world of today, with massive data volume growth. It is sometimes very difficult to consume all currently existing information before starting to delve into a specific field. In this case, when we talk about image captioning and, as already said, having so much meaning in this task, it was found that there is much literature, which is hard to summarize and thus stay up to date with the newest achievements. There are only a few SLRs that have been conducted for image captioning until now [6][7][8][9], though with such fast progress and increasing popularity in this field we find it necessary that they continue to be undertaken. Moreover, results of image captioning models in previous reviews were not as detailed as they are in this paper. Researchers dedicated time to detailed study of most articles in image captioning-digital libraries, which store most of the articles, were identified, search questions carefully formulated, all articles found were precisely analyzed, and results presented together with important challenges which were captured through the review process. This work follows ref. [6] as a guideline due to the easily understandable structure of their work and the similar ideas.

Search Sources
Digital libraries today are the most suitable platforms for books, journals, and articles search. In this literature review we chose three digital libraries due to limited resources and the huge number of articles under this topic. However, we can clearly see that these libraries cover a significant amount of the relevant literature sources for our study. Three different digital libraries were used to execute a research: 1. ArXiv 2.
Web of Science-WOS (previously known as Web of Knowledge) There have been many researches done in the field of image captioning so we narrowed down the literature review by searching for articles only from the last four years-from 2016 to 2019. During the research in the digital library, we filtered out articles, which were posted under the computer science topic.

Search Questions
It is very important to have clear questions which need to be answered after the whole literature has been reviewed. The results retrieved after each query must be precise, without too much noise and without unnecessary articles, so the questions were carefully formulated after many attempts. In this paper we answer four questions:

1.
What techniques have been used in image caption generation? 2.
What are the challenges in image captioning? 3.
How does the inclusion of novel image description and addition of semantics improve the performance of image captioning? 4.
What are the newest researches on image captioning?
Questions were selected to fully cover the main objectives of this paper-to present the main techniques used in image captioning in the past four years, as well as to identify the main challenges the researchers have faced. Furthermore, we aimed to summarize results from the newest papers published for a fair comparison of upcoming papers, so we included a generic query for image captioning, but filtered out articles from year 2019. In general, an image captioning query would be too broad and as we have a strong focus to introduce readers to the newest achievements, we need to read only the current newest articles. Although there have been a lot of good results achieved in earlier years, we omitted questions 1-3 covering the years 2016-2019. It is necessary to compare new researches with the best results achieved in image captioning which might be hard to find due to a large number of articles in this area and the low visibility of the less cited ones.

Search Query
To become acquainted with the "image caption generation" topic we first conducted a quick review of articles under it. We obtained an idea of the technologies and models, which are popular under this topic so that our research would be relevant and correct. Moreover, we did not narrow the search query to small details in order to get enough results and an appropriate number of articles from the search-keywords are presented identically to how they were submitted for the search query. The query questions together with the number of articles found in each library are presented below in Tables 1-4. Libraries were read in the order as they are listed in the table-first the query was searched in ArXiv, then in IEEE Xplore, and finally in WOS. If the article found had already been previewed from the previous library or from a previous query, it was not added to the total number of relevant articles but identified in brackets.   It is quite clear that the WOS digital library usually brings out the largest number of results, though with the smallest percentage of relevant articles for the topic of interest. ArXiv was the most precise and had the best ratio between relevant and all other articles from this study experience.

Results
After reading all the articles and inspired by another SLR [6] we achieved a good understanding on the key aspects in image captioning. To present the summarized results in a convenient way, a comprehensive comparison table (Table A1 in Appendix A ) of all articles found with the methods used was made together with the results on the most used datasets for testing. The structure is presented below: • Under each column, representing one aspect, x was written if this aspect appeared in the article. The following columns represent evaluation metrics results on two datasets -MS COCO and Flickr30k. If no testing was performed on one of the two selected datasets, the cells were left empty. If a different dataset or evaluation metric was used in the article, a short note was provided. i.e., Ref. [10] used the Lifelog Dataset for model training and evaluation, Ref. [20]-the Visual Genome Dataset, Ref. [39] evaluated results using F-1 score metrics, Ref. [47]-R metrics. To present the results to be easier understood, we first presented five articles from each one which achieved the most results- Tables 5-8. There were only three articles from 2016 which were evaluated on MS COCO, so only those were presented. Other tables have five articles with the top five results on MS COCO based on the highest CIDEr metric results.
The distribution of each year's results based on the six main metrics is presented in Figures 1-6. The figures do not provide information on how the results changed throughout the year, yet we can still identify inconsistency from one year to another. For example, results achieved in 2018 are many times lower than the ones achieved in 2017 in all metrics. Moreover, from Tables 7 and 8 we can see that there were some results in 2018, which were higher than the results in 2019 which confirms the assumption of this study about the difficulty in keeping up with the newest articles. The highest result for the CIDEr evaluation metric on MS COCO from all articles found during this SLR was reached in 2019 in ref. [87], but was only 0,2 higher than the result from 2018 in ref. [70]. None of the papers which were published in 2019 included this result as a comparison with their achieved results. In most of the papers models were compared with state of the art methods and so they were stated to have achieved better results while there were already much higher results in different papers.  Tables 7 and 8 we can see that there were some results in 2018, which were higher than the results in 2019 which confirms the assumption of this study about the difficulty in keeping up with the newest articles. The highest result for the CIDEr evaluation metric on MS COCO from all articles found during this SLR was reached in 2019 in ref. [87], but was only 0,2 higher than the result from 2018 in ref. [70]. None of the papers which were published in 2019 included this result as a comparison with their achieved results. In most of the papers models were compared with state of the art methods and so they were stated to have achieved better results while there were already much higher results in different papers.    Tables 7 and 8 we can see that there were some results in 2018, which were higher than the results in 2019 which confirms the assumption of this study about the difficulty in keeping up with the newest articles. The highest result for the CIDEr evaluation metric on MS COCO from all articles found during this SLR was reached in 2019 in ref. [87], but was only 0,2 higher than the result from 2018 in ref. [70]. None of the papers which were published in 2019 included this result as a comparison with their achieved results. In most of the papers models were compared with state of the art methods and so they were stated to have achieved better results while there were already much higher results in different papers.                 Tables 9 and 10 present results based on different techniques used in image captioning-which combinations of encoder and decoder were used every year and which methods were the most popular. Those tables help to understand which techniques work best together, and which combinations have probably not been successful or have not been explored at all up to now.

Discussion
In this paragraph we discuss the key aspects of all the papers reviewed during SLR. We also present new ideas which could possibly lead to a better image captioning performance. Each aspect is explained in a separate paragraph of this section.

Model Architecture and Computational Resources
Most of the models rely on the widespread encoder-decoder framework, which is flexible and effective. Sometimes it is defined as a structure of CNN + RNN. Usually a convolutional neural network (CNN) represents the encoder, and a recurrent neural network (RNN) the decoder. The encoder is the one which "reads" an image-given an input image, it extracts a high-level feature representation. The decoder is the one which generates words-given the image representation from the encoder (encoded image), it generates words to represent the image with a full grammatically and stylistically correct sentence.

Encoder-CNN
As there is usually only one encoder in the model, the performance is highly reliant on the CNN deployed. Even though we identified five convolutional networks in our research, there are two which stand out and were used the most. The first most popular choice for the feature extractor from images is VGGNet, preferred for the simplicity of the model and for its power. During this study it was found that VGG was used in 33 of 78 reviewed articles. However, the same number of articles which used ResNet as an encoder was also found. ResNet wins for being computationally the most efficient compared to all other convolutional networks. In ref. [88] a clear comparison of four networks-AlexNet, VGGNet, ResNet, and GoogleNet (also called Inception-X Net) was made-results are presented in the Table 11 below. It is clear from Table 11 that ResNet performs best-from both Top-1 and Top-5 accuracy. It also has much fewer parameters than VGG which saves computational resources. However, being easy to implement, VGG remains popular among researchers and has the second highest result, regarding the review from ref. [88]. The newest research mostly focuses on prioritizing simplicity and speed at a slight cost in performance. It is a matter for a researcher to decide if he or she needs more precision in the results, a more effectively performing model, or more simplicity.

Decoder-LSTM
LSTM (long-short-term memory) was developed from RNN, with the intention to work with sequential data. It is now considered as the most popular method for image captioning due to its effectiveness in memorizing long term dependencies through a memory cell. Undoubtedly this requires a lot of storage and is complex to build and maintain. There have been intentions to replace it with CNN [52], but as we can see from the number of times this method is used in most of the articles found during this SLR (68 of 78), scientists always come back to LSTM. LSTM works by generating a caption by making one word at every time step conditioned on a context vector, together with the previous hidden state and the earlier generated words.
Computational speed not only depends on the feature detection model, but also on the size of the vocabulary-each new word added consumes more time. Just recently [73] scientists have tried to solve the image captioning task by resizing the vocabulary dictionary. Usually the vocabulary size might vary from 10,000 to 40,000 words, while their model relies on 258 words. The decrease is quite sharp-reduced by 39 times if compared to 10,000, but the results are high, with some space for improvements.

Attention Mechanism
The attention model was established with an intention to replicate natural human behavior-before summarizing an image, people tend to pay attention to specific regions of that image and then form a good explanation of the relationship of objects in those regions. The same approach is used in the attention model. There are several ways in which researchers have tried to duplicate it, which are widely known as hard or soft attention mechanisms [5]. Some other scientists have highlighted top-down and bottom-up attention models. Ref. [89] recently confirmed that the better approach is still top-down attention mechanisms as the results from experiments with humans and with machines showed similar results. In the top down model, the process starts from a given image as input and then converts it into words. Moreover, a new multi-modal dataset is created with the highest number of new instances from human fixations and scene descriptions.

Datasets
Most of the works are evaluated on Flickr30k [90] and MSCOCO [91] datasets. Both datasets are rich in the number of images and each image has five captions assigned which makes it very suitable to train and test the models. It is of course necessary to continuously compare models with the same datasets in order to check the performance, however, they are very limited in the object classes and scenarios presented. The need of new datasets has always been an open question in image captioning. Ref. [92] proposed a method for gathering large datasets of images from the internet which might be helpful for replacing MS COCO or Flickr datasets which were used in most of the previous researches. There have been several other datasets used for model evaluation, such as Lifelog dataset [10], Visual Genome dataset [20,36], IAPRTC-12 [45], OpenImages and Visual Relationship Detection datasets [36], but they were just single cases.
Recently the popularity in novel image scenarios has grown which has increased the demand of newer datasets even more. In ref. [93] the first rigorous and large-scale data set for novel object captioning, which contains more than 500 novel object classes, was introduced. Another realistic dataset was introduced in ref. [94]. It contains news images and their actual captions, along with their associated news articles, news categories, and keyword labels. Moreover, it is clear, that social networks are highly integrated into people's lifestyle. There are more and more images appearing on the social media, especially from the young generation, so it is important to analyze this data as well-for the most natural background, for the newest trends to be interpreted by machines, and to start learning and improving on those as well. Ref. [95] proposed a novel deep feature learning paradigm based on social collective intelligence, which can be acquired from the inexhaustible social multimedia content on the Web, particularly largely social images and tags, however, it was not further continued, at least to our knowledge.

Human-Like Feeling
In the last year, two keywords have come into the vocabulary of almost every article written under the image captioning topic-novel and semantics. These keywords are important for solving the biggest challenge in this exercise i.e. generating a caption in a way that it would be inseparable from human written ones. Semantics implementation [49] is supposed to design a clean way of injecting sentiment into the current image captioning system. Novel objects must be included for the expansion of scenarios. There have been several insights on why this is still an open issue. First of all, usually models are built on very specific datasets, which do not cover all possible scenarios and are not applicable in describing diverse environment. The same with vocabulary as it has a limited number of words and their combinations. Second, models are usually thought to perform on one specific task, while humans are able to work on many tasks simultaneously. Ref. [35] has already tried to overcome this problem and has provided a solution although it was not further continued. Another great approach for dealing with unseen data, as it is currently impossible to feed all existing data into the machine, was proposed in ref. [56,96]. Lifelong learning is based on a questioning approach i.e. making a discussion directly with the user or inside the model. This approach relies on a natural way of human communication; from early childhood children mostly learn by asking questions. The model is intended to learn also like a child-by asking specific questions and learning from the answers. This method falls under the question answering topic-a literature research in depth might be done on this topic as here we have presented only what appeared during this study on image captioning. This can be targeted as a separate problem, but it also makes a great impact in image captioning.

Comparison of Results
This study found many articles in which the results of their models had been compared with state of the art models, such as refs. [2][3][4][5]. As these models were built some years ago, they have been more cited so are easier to find during a search on the digital libraries. For example, ref. [5] has been cited 2855 times, according to Google Scholar from Google, while most of the newest articles found have not been cited at all yet, or the ones written in 2018 have usually been cited less than 10 times. Not surprisingly the newer the articles are, the further at the bottom of the search they appear, so most researchers might not even find them if not enough time has been dedicated for a literature review. Figures 1-6 confirm that results are not steadily increasing-there are many results which are not higher than the ones from a year ago. This can undoubtedly be due to the topic difficulty, but also lack of details can lower the goals of researchers so they do not improve knowing that there are higher results already even though a very important part for researchers is to compare their work results with similar approaches. In this study the results from the newest models are presented so upcoming researchers can compare their models with regard to the newest achievements. We hope this research will help further researchers to save their time on detailed literature reviews and to keep in mind the importance of checking for the newest articles.

Conclusions
Image captioning is a very exciting exercise and raises tough competition among researchers. There are more and more scientists who are deciding to explore this study field, so the amount of information is constantly increasing. It was noticed that the results are usually compared with quite old articles, although there are dozens of new ones, with even higher results and new ideas for improvements. The comparison with older articles gives a misunderstanding of the real view of result increase-usually there have been much higher results already achieved, however not included in the paper. New ideas can also very easily become lost if they are not looked for carefully. In order to prevent good ideas been lost and to increase fair competition among the new models created, this systematic literature review summarizes all the newest articles and their results in one place. Moreover, it is still not clear if MS COCO and Flick30k datasets are enough for model evaluation and if they serve sufficiently well when having in mind diverse environments. The amount of data will never stop increasing and new information will keep appearing, so future studies should consider if static models are good enough when thinking of long term application or if lifelong learning should be increasingly thought of. We hope this SLR will serve other scientists as a guideline and as an encouragement of the newest information to be collected for their research evaluation.
Author Contributions: All authors contributed to designing and performing measurements, data analysis, scientific discussions, and writing the article.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.