Image-Captioning Model Compression

: Image captioning is a very important task, which is on the edge between natural language processing (NLP) and computer vision (CV). The current quality of the captioning models allows them to be used for practical tasks, but they require both large computational power and considerable storage space. Despite the practical importance of the image-captioning problem, only a few papers have investigated model size compression in order to prepare them for use on mobile devices. Furthermore, these works usually only investigate decoder compression in a typical encoder–decoder architecture, while the encoder traditionally occupies most of the space. We applied the most efﬁcient model-compression techniques such as architectural changes, pruning and quantization to several state-of-the-art image-captioning architectures. As a result, all of these models were compressed by no less than 91% in terms of memory (including encoder), but lost no more than 2% and 4.5% in metrics such as CIDEr and SPICE, respectively. At the same time, the best model showed results of 127.4 CIDEr and 21.4 SPICE, with a size equal to only 34.8 MB, which sets a strong baseline for compression problems for image-captioning models, and could be used for practical applications.


Introduction
One of the most significant tasks combining two different domains such as CV and NLP is the image-captioning task [1]. Its goal is to automatically generate a caption describing an image given as an input. The description should contain not only a listing of the objects in this image, but should also take into account their signs, interactions between them, etc., in order for this description to be as humanlike as possible. This is an important task for many practical applications in everyday life, such as human-computer interactions, help for visually impaired people and image searching [2,3].
Typically, image-captioning models are based on encoder-decoder architecture. Some of the models, such as [4][5][6][7][8], use a usual convolutional neural network (CNN) as an encoder. However, usage of image detectors, such as in [9,10] for this purpose, has been rising in popularity in recent years. The decoder here is a text generator, most often represented as a recurrent neural network (RNN) [11] or a long short-term memory network (LSTM) [12] with attention [13]. However, more complex models based on the architecture of transformers [14], which are the state of the art in a variety of NLP problems, have been created, using transformers both for sentences [15][16][17][18] and images [19].
Unsurprisingly, rising interest in image-captioning models has significantly improved the quality of the models in a very short time. At the same time, improvements occur not only due to more complex architectures of neural networks, but also due to an increase in the number of parameters and, accordingly, the size of the captioning model. Moreover, it is not known for certain whether a more successful architecture and training method, or a physically larger number of parameters, leads to an increase in quality. At the same time, the deep neural network models are more widely used on mobile devices, which have limitations both in computing power and in the size of storage. As a result, current state-of-the-art models, such as, for example [9,10], cannot be used on mobile devices, as they usually weigh 500MB and more. The problem of model compression for image captioning is very poorly studied. Even presented papers [20][21][22] usually only research decoder compression, although the encoder is an integral part of the model used when applying it to new data.
At the same time, model-compression techniques in other related areas of CV and NLP are well studied. So, for example, research on the various architectures of the object detector in the image (which is often used as an encoder for the image-captioning task) [23][24][25] have been conducted for a long time, with the aim of significantly reducing their size and increasing the operating speed without losing quality (or even increasing it). On the other hand, methods of compressing NLP models are investigated for various tasks, starting from text classification [26], and ending with text translation [27], which is similar in meaning to the image-captioning problem.
In this article, we have decided to fill this gap and undertake a comprehensive use and analysis of deep neural networks (DNN) model-compression techniques [28] applied to image-captioning models. We investigated various encoder architectures specifically tailored to the task of image detection in a low-resource environment. We also applied methods of compressing models to the decoder, reducing the architecture of the model itself, as well as applying various pruning [29] and quantization [30] methods. As a result, we were able to significantly reduce the occupied space of two state-of-the-art models named Up-Down [9] and AoANet [10], while not losing much in terms of important metrics. For example, using our approach, we were able to reduce the size of the classic Up-Down model from 661.4 MB to 58.5 MB (including the encoder), that is, by 91.2%. At the same time, the main image-captioning metrics such as CIDEr [31] and SPICE [32]  The main contributions of this paper are as follows: • We have proposed the use of modern methods of model compression in relation to the image-captioning task, both for the encoder and for the decoder in order to reduce the overall size of the model; • We compared different options for such a reduction, such as different encoder models, different decoder architecture variations, as well as different pruning options, and conducted a study of the effect of quantization; • The proposed methods allowed us to significantly reduce the size of the models, without significant loss of quality; • The methods worked universally on two different models, which suggests that they can be successfully used for other architectures used for the image-captioning tasks.
The paper is organized as follows: • In Section 2 we review paper related to this work, including such topics as imagecaptioning tasks in general, neural network model-compression techniques and their particular application to image-captioning tasks; • In Section 3 we describe our methodology, firstly describing encoder and then decoder compression approaches; • In Section 4 we report the design of our experiments as well as their numerical results and their analysis; • In Section 5 we state the main achievements and a discussion of the results; • In Section 6 we conclude the work and name future work directions.

Image Captioning
The task of the automatic generation of a caption describing an image is an important task for a smoother human-machine interaction. One of the earliest successful methods of such generation, which laid the foundation for modern image-captioning methods, is [4]. The architecture is a typical encoder-decoder architecture, which first generates some representation of the image, and then, based on this sentence, generates text, most often word by word. This model is followed by most modern image-captioning architectures.
Two key works that have significantly improved the performance of image-captioning models are [8,9]. The first work suggested using the REINFORCE algorithm to directly optimize the discrete model's quality metric named CIDEr, thereby directly improving its value. The second work suggested using the detector of objects in the image as a encoder (the article itself uses Faster R-CNN [24]), and then, based on the detected objects and their attributes, generating a caption, instead of trying to compress the entire image into one vector and generating it based on such representation.
In addition, starting with [5], the attention mechanism is actively used for the imagecaptioning task, which allows the model to adaptively increase the attention to objects or areas that are most important for generating a text description for the current image. It is used by high-quality works such as [10,33,34] and others.
Recently, transformers [14] have been gaining more and more popularity, becoming state-of-the-art models for many tasks from the field of NLP. Their use improves the quality of image-captioning models as well, which is investigated in [15,16,18].
Also, research in the field of image captioning is moving towards the unification of models with other vision-language understanding and generation tasks. Works such as [35,36] use networks pretrained on a giant dataset and obtain a model capable of solving a number of vision-language problems. However, due to the fact that such pretraining requires huge computational resources, it is difficult for researchers to find improvements to such models and broadly study them.

Effective Architectures
One of the most effective ways to reduce model size is to look for a more efficient architecture that contains fewer parameters but still performs well. The search for such architectures is especially active for those tasks that, on the one hand, are very useful to be able to solve in real time on mobile devices, and, on the other hand, which, as a rule, have too cumbersome architectures. One of these tasks is the task of detecting an object in an image, together with its classification and the determination of its attributes. So, for example, Faster R-CNN from [24] uses Region Proposal Network (RPN) and combines it with Fast R-CNN from [23], along with using convolutional neural network (VGG-16 [37]) as a backbone. SSD [38] initially creates a set of default boxes over different aspect ratios and scales, and then determines whether there are objects of interest. RetinaNet from [39] is a small, dense detector, performing well thanks to focal loss-oriented training.
As a rule, the backbone in the form of a convolutional neural network acts as one of the important components of detectors. Therefore, the task of selecting an effective convolutional network architecture is also on the agenda. Over time, various ideas have appeared to reduce the number of parameters, and, accordingly, the space occupied by the model. One of the most popular, MobileNetV3 [40] was found by a neural architecture search based on previous MobileNet architecture versions. The other widely used convolution neural network architecture is EfficientNet [41], which was explored during specialized research, concentrated on the amount of parameters that influence on the model quality. This network is also used in EfficientDet [25], which generalizes the ideas presented by transferring them to the objects' detection area.

Pruning
Pruning is one of the most popular and effective methods for reducing the size of models [42][43][44]. Its idea is to remove model parameters that are not useful or meaningful when present in the model. Thus, deleting such parameters does not cause much damage to the final quality, but at the same time it allows you to save storage space and computing resources.
In general, all pruning methods can be divided into two categories: structured pruning [45,46] and unstructured pruning. Within structured pruning, entire rows, columns or channels of layers are removed. Such techniques are used both for reducing CNN models and for models using RNN.
Another category is unstructured pruning [47,48]. In the framework of this kind of methods, the removal of weights is not tied to a specific structure of the model; any weights can be removed depending on the criterion of their "importance". Such methods are actively used in different variations for various tasks. For initial research, we focused on relatively simple methods that can achieve a good result.

Quantization
Another method of compressing a model without losing quality is quantization [49]. The essence of quantization is to use lower precision numbers for storing the model in order to reduce the amount of occupied space, and at the same time, without losing quality. Indeed, most practical applications do not require the precision that standard floating point data types provide.
Quantization approaches could be divided into two groups based on the type of compression range defined: static and dynamic quantization. In static quantization [50,51], the clipping range is calculated before inference and remains the same during the model's runtime. In the other group of methods, called dynamic quantization [52,53], the clipping range is dynamically calculated for each activation map during model application.

Model Size Reducing for Image Captioning
The area of model compression for image captioning is underexplored. Only a few works offer their own approaches to reducing the size of the image-captioning models. In [54], the authors propose to use SquezeNet [55] as an encoder and LightRNN [56] as a decoder. The authors of [57] propose a way to reduce model size, tackling the problem of huge embedding matrices which are not scalable to bigger vocabularies. In [20,21], novel special pruning approaches to the decoder of image-captioning models are proposed.
However, all of these works investigate narrow aspects, for example, only reducing the size of the encoder or decoder or using only one method of compression. In contrast, our article is intended to provide a comprehensive understanding of the methods for compressing models for image captioning, using various combinations of approaches in order to reduce the final model size as much as possible without much loss of quality.

Methodology
This section will describe the methods we used to reduce the size of the models. Both the methods used for both investigated architectures (Up-Down and AoANet) and methods specific to each particular model will be described. It is important to say that we didn't aim to compare Up-Down and AoANet models between each other, but to compare compression methods for several models. A comparison of the original models can be found in [10].

Encoder Compression
As mentioned above, the overwhelming majority of state-of-the-art image-captioning architectures use an image object detector as an encoder. That is, typical models are built according to the following scheme: let I be the input image, E be a detection model, and D be a decoder. Firstly, the method generates k image features E(I) such that each image feature encodes a region of the image. Then, the decoder generates caption D(V) based on these features.
We propose to use this general scheme as a sample, however, as an encoder, we take models that are more suitable for our task. One of these options is even saving the Faster R-CNN detection model, but with a different backbone (since it is this backbone that contains most of the parameters). As such a backbone, we propose using MobileNetV3, which performs well for memory-restricted tasks. For comparison, we also propose taking a detector that was initially sharpened for a small memory while maintaining good quality. EfficientDet was chosen as such a detector.
Thus, as E, we will have Faster R-CNN ResNet101 used in the original Up-Down model, as well as Faster R-CNN MobileNetV2 and EfficientDet. Because the sizes of these detectors are already small, we have not considered other methods of encoder compression.

Decoder Compression
Assuming that the current models contain more parameters than are necessary to achieve the same quality, we first investigate whether they can be reduced in number without changing the nature of the architecture. In this case, the method for a specific change in the number of parameters depends on the model under study. Further, on the best architecture, pruning methods are applied, and the best-pruned model is eventually quantized.

Architecture Changes
Following [9], the decoder model has three important logical parts: • Embeddings Calculation In this part, one-hot encoded vectors representing words are transformed to word vectors. Let Π be one-hot encoding of the input word and W e ∈ R E×|Σ| is a wordembedding matrix for a vocabulary Σ. Then, word vector π could be obtained by the following equation: Here, the size of the parameter matrix depends on embedding dimension E and the size of the vocabulary |Σ|. As the size of the vocabulary is a fixed value based on the preprocessing of the dataset, we experimented with changing E and explored its influence on the overall model performance as well as on the model size. • Top-Down Attention LSTM and Language LSTM As these two modules are similar to each other in terms of number of parameters and inner representations, we treat them as one logical part. Both modules are LSTMs, so generally the operation of them over a single time step is the following: where h t ∈ R M . In this case, the sizes of the parameter matrices of LSTM strongly depend on parameter M. We manipulated this parameter by trying to reduce it without causing huge losses in the model's quality.

• Attention Module
The third important module of the Up-Down architecture is called Attention Module. It is used to calculate attention weights α t for the encoder's features v i . It works in the following way: where W va ∈ R H×V , W ha ∈ R H×M . The main constants that influence the amount of weights in this module are H, V and M. V is the size of the encoder vectors, which is fixed by encoder choice, M has been discussed previously, and H is the dimension of input attention representation. We investigated how changing H influenced both model size and its performance along with other parameters.
Thus, in trying to reach a smaller model size without harming its performance metrics, we manipulated it with model parameters such as E, M and H.
The model AoANet described in [10], similarly to Up-Down, has both embedding calculation and LSTM parts. So, the reasoning for parameters E and M is similar to the previous subsection. However, the other parts of the architecture are different, so we don't experiment with them in this paper.
We will reduce the number of all the parameters multiplying E, M and H for Up-Down and E and M for AoANet used in the original papers by the equal scale factor γ.

Decoder Pruning
Because most of the layers of the decoders we have considered are linear or LSTM layers, the most suitable pruning method is unstructured pruning. We propose to carry it out after training, fixing the model.
There are two parts of the decoder: calculating embeddings and processing them. The embedding part is a separate, important part, because it is responsible for the selection of vectors that will represent the words. In order to determine the effect of pruning of embeddings, as the main semantic part, on the quality of a model, we consider two options: pruning of the entire model and pruning of everything except embeddings.
Let the model to prune be M(W 1 , W 2 ), where W 1 ∈ R m 1 is a vector of model parameters to which the pruning algorithm is not applied, and W 2 ∈ R m 2 is a vector of model parameters to which the pruning algorithm is applied, sorted by the increase in their l 1 norms, let α ∈ [0, 1] be the pruning coefficient, and A(W, α) be the pruning algorithm, generating mask of weights which would be pruned. Thus, the final model after pruning would be M(W 1 , W 2 · A(W 2 , α)), where · denotes elementwise multiplication.
We concentrate on two methods of choosing weights to prune:

Decoder Quantization
Having fixed the model, we use post-training dynamic quantization. When converting from floating point numbers to integers, this number is multiplied by some constant, subsequently rounding the result so that it fits into an int. This constant can be defined in different ways. The essence of dynamic quantization is that, despite the fact that it quantizes the weights of the model before applying it and keeps the model compressed, it calculates a constant by which to multiply activations during the calculation, depending on the input data. This approach allows us to maintain maximum accuracy while storing the model in a compressed form. This method of quantization was chosen due to the fact that we are primarily interested in the accuracy and size of the model, and not the speed of its application.

Experimental Setup
To compare the effectiveness of different model-compression methods, we used MSCOCO [58], which is the most common image caption benchmark dataset. It consists of 82,783 training images and 40,504 validation images. There are five different captions for each of the images. We used the standard Karpathy split from [7] for offline evaluation. As a result, the final dataset consists for 113,287 images for training, 5000 images for validation and 5000 images for testing.
We also performed preprocessing of the dataset by replacing all the words that occurred less than five times with a special token <UNK>. Further to following commonly used procedures, we truncated the words to maintain a maximal length of the caption equal to 16.
Metrics BLEU [59], METEOR [60], ROUGE-L [61], CIDEr [31] and SPICE [32] were used to compare the results. BLEU came from a machine translation task. It uses ngrams precision to calculate a similarity score between reference and generated sentences. METEOR uses synonym matching functions along with n-grams matching. ROUGE-L is based on the longest common subsequence statistics. Two of the most important metrics for image captioning are CIDEr and SPICE, as they are human consensus metrics. CIDEr is based on n-grams and performs TF-IDF weighting on each of them. SPICE uses semantic graph parsing in order to determine how well the model has captured the attributes of objects and their relationships.
For our experiments, we took models based on the https://github.com/ruotianluo/ ImageCaptioning.pytorch (access on 4 November 2021) public repository. The implementation framework was PyTorch. For encoders, the built-in PyTorch Faster R-CNN was used, as well as the https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch (access on 4 November 2021) repository for EfficientDet. All models were trained for 30 epochs in the usual way, and then for 10 more epochs in a self-critical style following [8]. The models were trained and tested on Google Cloud Platform using servers with eight CPU cores, 32 GB of RAM and one Tesla V100 GPU accelerator.

Encoder Compression
In Table 1, we present the results of a comparison of Up-Down and AoANet models modified using different encoders. Variants with a Faster R-CNN ResNet101 encoder refer to the same models as in original papers. It can be observed that all of the proposed variations both for Up-Down and AoANet models have quality metrics very close to each other. This aligns well with the fact that despite their small size, image-detection models used as encoders in these experiments show great results in the standalone object detection task.
Furthermore, usage of smaller encoders designed to be launched on mobile devices helps to reduce the encoder's size from 468.4 MB to 15.1 MB (a 96.8% reduction), losing only 0.6 in CIDEr and 0.3 in SPICE scores. We choose encoder EfficientDet-D0 to be the most suitable for this task among all other tested ones, and we use it in all further experiments. Table 1. Evaluation results of all of tested models with different encoders. Both "Size" and "#Params" columns refer to encoder.

Decoder Architecture Changes
In Table 2 we evaluate models with their decoder reduced using different scale factors. A value of γ equal to 1 corresponds to using original models, but, as said before, with EfficientDet-D0 encoder. For the Up-Down model as well as for AoANet, decoder reduction with scale factor equal to 0.5 shows results comparable to the model without reduction, but with approximately three times less memory and number of parameters. Reducing the model further leads to greater loss in quality. Thus, for example, Up-Down model's CIDEr metric decreases by only 0.3 points when the scale factor of 0.5 is used, but when the scale factor of 0.25 is used it decreases by 4.9 points, which is a much bigger loss. A similar situation is seen for the CIDEr score drop for the AoANet model. So, in this case, using scale factor γ = 0.5 acts as a compromise which helps to both reduce the model size and leave quality metrics on the same level. We will use decoders reduced with a scale factor equal to 0.5 for our experiments for the rest of the paper.

Decoder Pruning
Results of pruning using different methods and pruning coefficients are presented in Figures 1 and 2. Blue and orange lines on both figures are correspondent to l 1 pruning, while red and green lines are correspondent to random pruning. "True" represents pruning the embedding layer and "False" represents not doing so. NNZ means "number of non-zero parameters", which is the common measure for comparing pruning techniques.
As it can be seen, random pruning is strictly worse than l 1 pruning. This could be due to the fact that diminishing less valuable weights of the model expectedly influences the model's quality metrics less than removing only random parameters of the model.
The other fact is that although not pruning the embeddings layer helps to maintain greater quality for big pruning coefficients for the Up-Down model, it almost doesn't show any difference for the AoANet model. Additionally, the great decline in metrics could be observed for the Up-Down model starting from α = 0.1, and for AoANet starting from α = 0.5. Taking this into account, in any case we are not interested in the area of Up-Down model pruning where the choice of pruning or not pruning embeddings could play any role.
As said before, the biggest pruning coefficients (which lead to a bigger model compression) with which models still show good quality are 0.1 and 0.5 for Up-Down and AoANet, respectively. Increasing these values could lead to a drop in metrics. More detailed comparisons of model results near this drop, along with the unpruned model quality, are reported in Table 3. The table proves the observations obtained from figures. The chosen boundary values for pruning coefficients help to reduce model size by 10% and losing only 0.6 CIDEr points for the Up-Down model, and reducing model size by 50% and losing only 1.5 CIDEr points for the AoANet model. Such a great reduction in AoANet could indicate that the overall model is bigger and has more parameters than needed to show comparable results with its architecture. Furthermore, this model's weights could be quite sparse, and lots of them could be close to 0. In this case, such extensive pruning wouldn't lead to big loss in performance. Table 3. Evaluation results of all of tested models' decoders pruned using different pruning coefficients. Both "Size" and "NNZ" columns refer to the decoder.    CIDEr and SPICE validation scores depending on pruning coefficient α for different pruning methods applied to AoANet model. "L1" indicates that l 1 pruning was used, whilst "Random" indicates that random pruning was used. "True" indicates pruning of the embedding layer and "False"indicates no pruning.

Decoder Quantization
For the quantization experiments, the best models obtained in the previous section were used. Thus, we used the Up-Down model pruned with α = 0.1 and the AoANet model pruned with α = 0.5. The results of the quantization experiments can be found in Table 4. Quantization helps to reduce model size without approximately any loss of quality. The resulting decoder size of the Up-Down model is 43.4 MB, and the decoder size of the AoANet model is 19.7 MB.

Discussion
The final models compressed using all of the proposed methods compared with the original ones are presented in Table 5. It can be observed that our methods helped to significantly reduce model size with a comparably small change in performance metrics. The Up-Down model's size was reduced from 661.4 MB to 58.5 MB, with a loss of 1.7 points in CIDEr and 0.6 points in SPICE metrics. The AoANet model's size was reduced from 791.8 MB to 34.8 MB, with a loss of 2.4 points in CIDEr and 1 point in SPICE metrics. Using the proposed methods worked well for both models, which could be a good indicator of the methods' generalizability. Sizes of 58.5 MB and 34.8 MB are already small enough to be used on mobile devices in real-world applications. Table 5. Evaluation results of all of tested models both original and compressed using methods from this paper. Both "Size" and "NNZ" columns refer to the whole model.

Conclusions
In this work we proposed the use of different neural network compression techniques for models trained to solve image-captioning tasks. Our extensive experiments showed that all of the proposed methods help to reduce model size with an insignificant loss in quality. Thus, the best obtained model reaches 127.4 CIDEr and 21.4 SPICE, with a weighting of only 34.8 MB. This sets up a strong baseline for future studies on this topic.
In future works, other compression methods such as knowledge distillation could be investigated in order to reduce the models' sizes even more. Also, some more complex pruning or quantization techniques could be used. The other direction for further research could be encoder compression, using not only different architectures, but also the other methods, for example, pruning and quantization. Reducing the size of more complex and bigger models such as [35,36] is also a promising direction of study.
Author Contributions: Conceptualization, methodology, software, writing-original draft preparation, visualization, investigation, editing V.A.; writing-review, supervision, project administration, funding acquisition D.Š. All authors have read and agreed to the published version of the manuscript.