Text Augmentation Using BERT for Image Captioning

: Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human. In recent years, this research ﬁeld has rapidly developed and a number of impressive results have been achieved. The typical models are based on a neural networks, including convolutional ones for encoding images and recurrent ones for decoding them into text. More than that, attention mechanism and transformers are actively used for boosting performance. However, even the best models have a limit in their quality with a lack of data. In order to generate a variety of descriptions of objects in different situations you need a large training set. The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image. We expanded the training dataset using text augmentation methods. Methods include augmentation with synonyms as a baseline and the state-of-the-art language model called Bidirectional Encoder Representations from Transformers (BERT). As a result, models that were trained on a datasets augmented show better results than that models trained on a dataset without augmentation.


Introduction
Image captioning is the task of automatically generating a textual description of an image [1]. The goal pursued by the researchers is to make these textual descriptions as similar as possible to how a human would describe an image. Systems of such generation capability can be used to help visually impaired people improve a human-computer interactions by introducing visual concepts to a computer and create better features for information retrieval using images [2,3].
In recent years, most approaches to solving the image captioning task include using neural networks. Most of the used neural networks architectures are of an encoder-decoder type for example [4][5][6][7][8]. In such models, an image is first encoded to its hidden representation and then a textual description of this image is generated (decoded) based on this hidden representation. Convolutional neural networks, such as VGG [9] and ResNet [10], are most often used as encoders, because they have proven themselves in a variety of different computer vision tasks. Recurrent neural networks, such as RNN [11] or LSTM [12], are used as decoders due to their wide applicability for natural language processing tasks.
However, the training and use of recurrent neural networks is quite challenging. To preserve the information from the previous steps RNN uses a hidden state which task is to summarize and store this information in a single vector. Nevertheless, this vector usually has a rather small dimension and, as a result, a limited ability to remember information. In addition, as the length of the sequence increases the complexity of training, the model for generating a sequence of words also increases.
Such methods as attention introduced in [13] and transformers described in [14], which has proven their efficiency in solving a large set of NLP tasks, particularly sequence to sequence tasks, • we proposed the use of augmentation of image captions in a dataset (including augmentation using BERT) to improve a solution of the image captioning problem. To our best knowledge this is the first attempt to improve quality in image captioning through captions augmentation; • we compared different augmentation methods-using synonyms and using BERT, as well as the various increase in a dataset size on the training three state-of-the-art image captioning models and its final quality; • intensive experiments have shown that the proposed augmentation methods improve model performance on the commonly used image captioning metrics, such as CIDEr and SPICe; and, • the proposed methods are not limited to an image captioning task and can be used for different vision-language tasks.

Image Captioning
Image captioning is a difficult task on the intersection of computer vision (CV) and natural language processing (NLP), which involves the generation of a short sentence describing the image [1].
Encoder-decoder architectures are commonly used for this task which shows decent results. The task of generating captures is a task of a sequence generation. More than that images can be treated like a sort of "visual language" so the image captioning task can be considered as a machine translate task from a "visual language" to a human language. Regarding this, many methods that have shown themselves in the machine translation problem have also been successfully adapted to solve the image captioning problem.
One of the most used methods that significantly improved the quality of basic models is a mechanism called attention [13]. As an attention special case visual attention mechanism is widely used in current state-of-the-art image captioning models. The main idea of visual attention is to allow models to selectively concentrate on objects of interest.
Another approach is to use advantages of a transformer-based architectures [14]. They are the state-of-the-art methods in sequence modeling tasks like machine translation [25] and language modeling [24]. A number of image captioning models use transformer architectures in order to improve their quality, like [26][27][28]. The great advantage of transformer models over other sequence generation models is that transformers use a fully connected neural network instead of a recurrent one. This simplifies model training and increases the ability of the model to take context into account. Recent studies have shown that such approaches can be also applied to the image captioning task.

Augmentation
Augmentation is a method of creating additional training data from existing dataset. There are many different augmentation techniques for data of a different nature. Augmentation showed itself well for main tasks that are associated with the analysis of structured data such as images [29] and text [30][31][32]. Usually augmentation methods depend on a specifics of the task however there are some general approaches. For example for image augmentation [29] horizontal flip, random crop and more are used. The various combinations of mentioned methods may also be used. Small image changes do not affect its class and its content what allows you to expand the training set with such manipulations. For the text augmentation tasks, random word deletions and insertions, synonymous replacements, and word permutations within a sentence are usually used [30].
However, augmentation is rarely used in vision-language tasks, such as image captioning, visual question answering, or visual dialog. The most of them, such as [33,34], use standard images augmentation methods described above. The others use text augmentation techniques. For example, in [35], simple augmentations, such as word permutations for captions or random words replacement, are used for better evaluation metrics design. However, there are several works about more complex augmentation methods for vision-language tasks. For example, in [36], the authors use the template based generation method that uses image annotations and a LSTM based language model for generating question-answer pairs about images. [24]. The idea behind BERT learning is to teach the model to predict missed words in a sentence. In order to do this, part of the words in the sentence are replaced with a special token (MASK) and the task of the model is to predict those words by their context. More than that to increase the model quality and to teach neural network to understand the relationship between sentences at the same time it is taught to predict whether one phrase is a logical continuation of the second. As a result, BERT is pre-trained on a large set of texts in an unsupervised manner having high-quality vector representations of words and a model that can predict words by context and the presence of a connection between sentences. In addition, thanks to its architecture, BERT can be easily fine-tuned and adopted for specific tasks from the NLP field. All of these features allowed it to become an essential part of almost all state-of-the-art NLP models for today.

Methodology
In this section, we will describe main methods of text augmentation. Particularly, we concentrate on methods used in our research. In Section 3.1, we describe captions augmentation using a synonymous replacement that we used as a baseline method for our studies. In Section 3.1, we concentrate on an augmentation using language models, especially contextualized word embeddings and BERT.

Synonymous Augmentation
Unlike image and speech processing augmentation by making random noises into the input signal (characters) is not suitable for text augmentation. The relative order of letters and their presence in a word can significantly affect the semantic meaning of the word itself. The best method of text augmentation is to rephrase sentence as a person would do. However, this approach is very complicated due to the size of a training datasets. One of the simpler, but still qualitative method, is synonymous replacing augmentation first introduced in [30].
Let I be an image from a training set, C = {c 1 , . . . , c k } be a set of captions corresponding to that image where each caption is a sequence of words c i = (w i,1 , w i,2 , . . . , w i,l i ), l i is a length of i-th caption. Additionally, fix some synonymous thesaurus T and let T(w i,j ) = (s i,j,1 , s i,j,2 , . . . , s i,j,m i,j ) be a list of synonyms to a word w i,j sorted in descending order of semantic closeness to the most frequently seen meaning of the word w i,j , m i,j -the number of synonyms of the word w i,j in the thesaurus T.
The following operation is performed in order to generate a new caption c i based on a c i . Fix some probability p. For every word w i,j ∈ c i , which has synonyms in a dictionary i.e., for which m i,j > 1 replace it with synonym with a probability p. To chose with which of the synonyms it will be replaced fix another probability q. With a probability q, replace a word with the most semantically close synonym s i,j,1 . If the word was not replaced by the first synonym replace it with the second closest one s i,j,2 with a probability q and so on. Thus, the probability of replacing word with synonym s i,j,r is equal to q r and it exponentially decreases with the semantic similarity of the synonym to the original word. This operation of replacing a word with it synonym occurs independently for each word in a sentence.
The above operation of obtaining a new caption based on an existing one is done d times, where d is called the augmentation coefficient. Accordingly, if the image has k captions after applying augmentation with a coefficient d it will have kd captions, i.e., training set is increased d times.

Contextualized Word Embeddings Augmentation
To augment with contextualized word embeddings approach similar to described in [31,32] can be used. Similarly to a synonymous replacement, let for some image I there is a set of sentences C = {c 1 , . . . , c k } describing that image. Each sentence is a word sequence c i = (w i,1 , w i,2 , . . . , w i,l i ). For the purpose of augmentation fix a language model LM that can predict the probability that a particular word w will occur in a certain context. More formally, consider some caption c i and the j-th word of this caption. Let its context be the entire caption, except of the word itself, which is is a probability distribution over the words that can stand on a place j in a caption c i taking its context into account.
The following procedure needs to be done to obtain a new caption c i based on an existing caption c i using the language model. Fix the probability p that each concrete word from the caption should be replaced by another word. In order to replace the word w i,j with another calculate LM(c i , j). After that, generate the word w i,j ∼ LM(c i , j) and take it as the next word of the new caption c i . Repeating the procedure for each word w i,j will create an augmented caption. Performing this operation of augmenting the sentence d times for each of the captions will lead to obtaining kd sentences describing the corresponding image.

Dataset
We used MSCOCO [15], which is the largest and most used dataset for image captioning as a base dataset for performing augmentation, in order to compare the effectiveness of the augmentation methods. Its standard version consists of 82,783 training images and 40,504 validation images. There are five different captions for each of the images. For offline evaluation, we used the standard Karpathy split from [6], which is used by the most of the articles for results comparison. As a result, the final dataset consists for 113,287 images for training, 5000 images for validation, and 5000 images for testing.
After dataset augmentation, we also performed postprocessing by replacing all the words that occurred less than five times in the final dataset with a special token <UNK>. More than that, because the vast majority of captions was no more than 16 words length we truncated the words to maintain the maximal length of the caption equal to 16.

Implementation Details
We compared five different augmentation techniques for augmenting the original dataset. Augmentation was only used for a train part of the dataset. In one of the options, we did not augment the dataset at all and used this result as a baseline for comparison. In the other methods, we augmented dataset using BERT with augmentation factor d equal to 2 and 3. We also performed augmentation using synonyms with an augmentation factor equal to 2. The default value of replacement rate p was equal to 0.1 in all cases except during the studies about replacement rate influence.
The effect of described augmentations was compared while training a model [40], which is one of the state-of-the-art models with an open source code. We conducted extensive experiments to choose the best augmentation method suitable for this model. Models on all datasets were trained for 12 epochs in a regular way and than for seven epochs in a self-critical way described in [4]. For captions generation during the testing phase beam search algorithm with a beam size of 5 was used.
Additionally, we have confirmed our results on the other state-of-the-art models, such as [41,42], on the best variant of dataset chosen based on the previous experiments with [40].
For all models, we used open source code released by the authors of the corresponding papers. We used https://github.com/aimagelab/meshed-memory-transformer for [40], https:// github.com/husthuaan/AoANet for [41], and https://github.com/JDAI-CV/image-captioning for [42]. For augmentation, the nlpaug library [43] was used. The models were trained and tested using the Google Cloud Platform on a cloud machine with 8 CPU cores, 30 GB operative memory and two Tesla K80 GPUs.

Results
To compare the results we used BLEU [44], METEOR [45], ROUGE-L [46], CIDEr [47], and SPICE [48], which are widely used for image captioning models comparison. BLEU is metrics which is widely used for machine translation task. It uses n-grams precision to calculate a similarity score between reference and generated sentences. METEOR is also a metric that is based on n-grams, but it uses synonym matching functions along with exact word matching. ROUGE-L is based on a longest common subsequence statistics. The longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically. We especially focused on a CIDEr and SPICE metrics, as they are a human consensus metrics. CIDEr measures the similarity of the captions generated by the model and the captions that created by a person using n-grams and performing TF-IDF weighting each of them. SPICE is based on a semantics graph parsing is used to determine how well model has captured the attributes of objects and their relationships.
It is important to note that all of the models trained based on their open source code show slightly worse result than that reported in corresponding papers.
In Figure 1, you can see a comparison of augmentation using BERT with d = 2 for different values of replacement rate p. It can be seen that, with a large value of p equal to 0.5 model performs worse than with smaller value equal to 0.1. This may be due to the fact that too much information is lost regarding the original caption and therefore the sentences become less grammatically correct and less human-like. On the other hand, model with a value of 0.1 performs better than all others. Such small changes adds some variety to the dataset without a damage to meaning and grammatical structure. The comparison of models trained using various augmentation techniques-synonyms and BERT-with a model trained on an original dataset is shown in Figure 2. The model trained on a dataset with synonymous augmentation is slightly better that the original model. A model that was trained on a dataset augmented with BERT shows significantly better results than both original model and the model trained on a synonymous augmented dataset. Models that are trained on a BERT augmented dataset with d = 2 and d = 3 and p = 0.1 (as the more promising replacement rate) are compared on a Figure 3. Comparison shows that the training on the more augmented dataset don't increases model performance. With a three-times increase in dataset size the model trains worse than with a two-times increase. This shows some boundaries of the proposed method and that the quality do not increase with the significant increase of the dataset more than two-times.  Table 1 summarizes the final test scores for all of the trained [40] models. The model trained on the two-times increased dataset obtained using BERT with p = 0.1 augmentation shows the best results in almost all of the metrics significantly exceeding the model trained on an original dataset by 2.7 points for CIDEr and 0.2 points for SPICE. This proves the applicability of the proposed augmentation method for models designed for image captioning task quality improvement. Augmentation can be widely used to increase the quality of existing state-of-the-art approaches without any changes to that models.  Table 2 summarizes the results for all three models trained with BERT augmentation with d = 2 and p = 0.1. All of the models trained with augmentation show better results than the corresponding models trained without augmentation. This confirms our conclusions about benefits of the proposed augmentation usage for training state-of-the-art image captioning models.

Quantitative Analysis
We selected several examples of captions generated by the resulting models on a test data. They are presented in Figure 4. Here, "Ground truth" denotes a real human-generated capture from a test set used in a quality measuring. "Original" denotes a caption that is generated by a model trained on an original dataset without augmentation. We can see that augmentation helps the model trained on augmented datasets to construct more elegant and rich sentences than the model trained on an original dataset. Additionally, some examples of augmentation of the original captions are presented in Figure 5.
Here, we can see that both augmentation methods using synonyms and BERT can diversify the captions adding a model potential to learn more complex and general ideas about textual description of the image. Additionally, since augmentation does not take into account the content of the image itself sometimes the augmented captions do not reflect the essence on the image well enough. This simulates the noise that may be present in the descriptions that are created by humans. Although augmentation is not perfect, in general captions are similar to the ground truth ones.

Conclusions
In this work, we proposed the use of augmentation of image captions in a dataset using synonyms and contextualized word embeddings. Comparison of the results achieved by the models during training on augmented datasets based on the MSCOCO dataset showed that the proposed augmentation methods improve the quality of models for solving the image captioning problem. It has also been shown that augmentation with contextualized word embeddings helps more than with a synonymous replacement. In addition, the larger than two-times increased dataset after augmentation do not improve the results of the models. This may indicate the limits of the proposed augmentation methods.
Despite the good results, it is worth noting that the captions generated by proposed augmentation methods cannot completely replace human ones. In addition we augment the captions only at the word level that limits the structure of augmented sentences (in particular, the number of words). Each word may not have as many words that can be used instead in a particular context. This can also make it difficult to automatically generate new captions for a training dataset.
In future works, other augmentation methods that work at the sentence level can be explored or text paraphrasing methods can be used for the same purpose. Additionally, the applicability of the proposed methods for the visual question answering task can be studied.
Author Contributions: Conceptualization, methodology, software, writing-original draft preparation, visualization, investigation, editing V.A.; writing-review, supervision, project administration, funding acquisition D.Š. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by Google Cloud Platform Education Programs Grant.

Conflicts of Interest:
The authors declare no conflict of interest.