Towards Mapping Images to Text Using Deep-Learning Architectures

: Images and text represent types of content that are used together for conveying a message. The process of mapping images to text can provide very useful information and can be included in many applications from the medical domain, applications for blind people, social networking, etc. In this paper, we investigate an approach for mapping images to text using a Kernel Ridge Regression model. We considered two types of features: simple RGB pixel-value features and image features extracted with deep-learning approaches. We investigated several neural network architectures for image feature extraction: VGG16, Inception V3, ResNet50, Xception. The experimental evaluation was performed on three data sets from different domains. The texts associated with images represent objective descriptions for two of the three data sets and subjective descriptions for the other data set. The experimental results show that the more complex deep-learning approaches that were used for feature extraction perform better than simple RGB pixel-value approaches. Moreover, the ResNet50 network architecture performs best in comparison to the other three deep network architectures considered for extracting image features. The model error obtained using the ResNet50 network is less by approx. 0.30 than other neural network architectures. We extracted natural language descriptors of images and we made a comparison between original and generated descriptive words. Furthermore, we investigated if there is a difference in performance between the type of text associated with the images: subjective or objective. The proposed model generated more similar descriptions to the original ones for the data set containing objective descriptions whose vocabulary is simpler, bigger and clearer.


Introduction
A quick look at an image is sufficient for a human to say a few words related to that image. However, this very easy task for humans is a very difficult task for existing computer vision systems. The majority of previous work in computer vision [1][2][3][4] has focused on labeling images with a fixed set of visual categories. Even though closed vocabularies of visual concepts are a convenient modeling assumption, they are quite restrictive when compared to the vast amount of rich descriptions and impressions that a human can compose.
Some approaches that address the challenge of generating image descriptions have been proposed [5,6]. In this work, we want to take a step forward towards the goal of generating descriptions of the images that are close to the natural language. Figure 1 gives a hint to the motivation of our work by showing several samples that were used in the experimental evaluation. Each sample consists of an image and the text associated with it. We chose three data sets from different domains. The first data set belongs to the social network domain. The text associated with each image is a subjective description or impression of that image written by a user. The second data set belongs to the medical domain. The text associated with each image is an objective description written by a radiologist. The third data set belongs to the gaming domain and it was formed using a game in which users must label images. The text associated with each image was written by a user and represents descriptive words of images. This paper is an extension of our preliminary work which was presented in Recent Advances in Natural Language Processing conference in 2019 [7]. The added contributions of the work described here compared to our preliminary work presented in [7] are: • Investigating several deep-learning architectures. Based on the preliminary investigations presented in [7], we concluded that the more complex deep-learning approaches are better than the simple RGB pixel values for the feature extraction tasks. For this reason, we further investigated several neural network architectures and we compared the model performance by the complexity of the neural network that was used as feature extractor. • Experimental evaluation on multiple data sets. We extended the experimental evaluation by using data sets from different domains, i.e., social media, medical data and gaming. • Qualitative analysis. In addition to the quantitative comparisons which were presented in the preliminary work, we further extended the analysis by also including a qualitative analysis. We compared visually the generated description with the original description of an image.

•
Objective vs subjective descriptions. In addition to only subjective descriptions that were analyzed in our previous work, we compared generated descriptions with the original description of an image in the context of the text type (subjective vs. objective). • Language comparison. We investigated whether the language of the text associated with images influences the performance.
Our core insight is that we can map images to natural text by leveraging the image-text data set in a supervised learning approach in which the image represents the input and the text represents the output. We employed a Kernel Ridge Regression model for the task of mapping images to text. We used two types of features: image and text features. The model generates text which consists of a set of words from a dictionary. We used a bag-of-words model to construct the text features. The image features were extracted with deep-learning approaches in the form of four convolutional neural network (CNN) architectures: 1. VGG16, 2. Inception V3, 3. ResNet50, 4. Xception.
The goal of our work is to compare different types of deep neural network architectures to generate descriptions of images. The four types of deep neural network architectures that we investigated were introduced as the winners of ImageNet challenge (2014-2016) [8]. From VGG14 network, which is the winner of ImageNet challenge 2014, the networks have improvements: the number of layers, the pooling layers, the activation and the loss function, the regularization and the optimization, the reduced number of parameters in relation to number of layers. The main challenge is finding a model that is rich enough to simultaneously reason about contents of images and their representations in natural language domain. Additionally, the model should be free of assumptions about specific templates or categories and instead rely on learning from the training data. The model will go beyond the simple objective description of an image and also give the impression that the image could make upon a certain person. An example of this is shown in the image from bottom-right of Figure 1A in which we do not have a captioning or description of the animal in the image but the subjective impression that the image makes upon the looker.
In the experimental evaluation we investigated three data sets from different domains. We designed a system that automatically associates an image to a set of words from a dictionary. Depending on the data set used, these words are not only descriptors of the content of the image, but also subjective impressions and opinions of the image. Two of the three data sets were written in English, while one data set was written in Spanish. In our experiments, we investigated whether the language of the text associated with the images influences the performance of the model.
There is a difference in performance for the four deep network architectures investigated in the experimental evaluation: the network whose architecture contains the deepest layers has the best results for all three data sets. In particular, the mapping images to text has the best results for the ResNet50 network architecture used as image feature extractor. In terms of similarity, if we compare generated description with the original description of an image, our proposed model performs better for the data set which has objective descriptions associated with images.
The novelty of our contribution is as follows: • We designed a system that automatically associates an image to a set of words from a dictionary using a Kernel Ridge Regression model. We showed that Kernel Ridge Regression, which is a combination between ridge regression and classification can be used in the problem of image description.

•
Based on the experimental evaluation, we confirm the potential of deep-learning techniques for image to text mapping. We considered two types of image features: RGB pixel-value features, and features extracted with four deep-learning approaches. The experimental results show that the features extracted using deep-learning architectures perform better than the RGB pixel-value features. Furthermore, the network whose architecture contains the most hidden layers performs best.

•
We investigated the difference between objective and subjective descriptions in three data sets from different domains: social media, medical and gaming domain. We noticed that our model generated more similar words to the original ones for objective descriptions. Furthermore, we noticed that the language of the text associated with images influences the performance of the algorithm.

•
The proposed method can predict text that are close to the original text associated with the image. In the experimental evaluation we compared the generated description with the original description of an image: our proposed method performs better for the data set with objective descriptions.

•
We consider that investigating different deep-learning architectures for feature extraction for mapping images to text is the added value of our work. To our knowledge, our approach based on a combination of ridge regression and deep learning has not been investigated for mapping images to text before.

Related Work
Image to text mapping. Image to text mapping can be divided into two categories: image captioning and image description. Several approaches for image captioning and image description tasks have been proposed [1,2,4,5]. State-of-the-art techniques for image captioning and image description tasks are based on recurrent neural networks [1,9].
Image captioning can be defined as an automated objective description of an image. This concept congregates two major areas of research: Computer Vision and Natural Language Processing. Organizing words into a sentence is not an easy task for a computer, image captioning needs a high level of understanding of semantic contain of an image and the ability to express image information in a human sentence. State-of-the-art techniques for image captioning tasks are based on recurrent neural networks, which take as an example a representation of the characteristics of an image. The research presented in this paper is in the direction of image captioning. Mapping images to text allows us to build some dictionaries of words and select from these dictionaries the words which are the most relevant to an image. The learning setting that we investigate in this paper is different to the image captioning setting because our system automatically associates an image to a set of words from a dictionary, these words being not only descriptors of the content of the image, but also subjective opinions of the image. deep-learning techniques are often used for image captioning tasks [1,2,4]. In [10] the authors proposed a new learning method named Contrastive Learning, which encourages distinctiveness, but at the same time aims to maintain the quality of the generated captions. A reference model was used during the learning process in addition to the image-text pairs. The authors introduced inadequate pairs as input, where the text is the description of another image. In [11] the authors proposed an approach for training an image captioning model in an unsupervised manner. In contrast to our setting their model requires an image set, a sentence corpus, and an existing visual concept detector.
Image description is more than image captioning. In image captioning tasks, the text associated with an image represents an objective description, while in image description tasks the text associated with an image is a subjective description. Several approaches that address the challenge of generating image descriptions have been proposed [5,6,9,12]. In [9] a hierarchical Recurrent Neural Network model based on the phrase was presented, which incorporates the natural language provided by the human expert. We considered that the task of generating objective description is in the image captioning domain, while the task of generating a subjective description is in the domain of image description. We used three data sets from different domains for experimental evaluation. Two data sets contain text in the form of objective descriptions and one data set contains subjective descriptions. The text from the latter data set is a subjective description of the image written by a user. In [5], the authors developed a multimodal Recurrent Neural Network architecture that is capable of generating a description for an input image. The model can find visual-semantic connections, even if the image shows a small object. In contrast with the model presented in [5], our approach associates an image to a set of words from a dictionary. The dictionary is formed based on the image descriptions which were provided by the users. In comparison with [5], our proposed model can also generate subjective descriptions.
Multimodal High-level Representation using Deep Learning. Multimodal data refers to the multiple modalities/types of information that are used in a research problem or an experience. Multimodal deep neural networks have been very successful in computer vision and natural language applications [13][14][15]. In [16] the authors investigated multimodal learning using audio and video data. The authors of [14] presented an approach to learn several specialist models using deep-learning techniques using video and audio information. In our study, we combined two types of information for improving the mapping images to text problem: images and associated tags or text explanations. Based on the information type used in a research, different types of information representations was developed for feature extraction. For visual information type, CNN are the main approach for image high-level representation [4,17]. We compared four deep neural network architectures for images high-level representation in the context of mapping images to text. For textual features, the authors of [18] proposed a document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. In our model, for textual data representation, we used a simplifying representation which represents text as a multiset.

Kernel Ridge Regression for Mapping Images to Text
In this section, we describe the Kernel Ridge Regression model that we use for mapping images to text.
Let X = {x 1 , x 2 , . . . , x n } and Y = {y 1 , y 2 , . . . , y n } be the set of inputs and outputs, respectively, and n represents the number of observations. And let F X ∈ R d X ×n and F Y ∈ R d Y ×n denote the input and output feature matrices, where d X , d Y represent the dimensions of the input and output features respectively. The inputs represent the images, and the input features can be either simple RGB pixel values or something more complex, such as features extracted automatically using convolutional neural networks [24]. The outputs represent the texts associated with the images and the output features can be extracted using Word2Vec [25].
A mapping between the inputs and the outputs can be formulated as a multi-linear regression problem [26,27]. Combined with Tikhonov regularization, this is also known as Kernel Ridge Regression (KRR). The KRR method is a regularized least squares method that is used for classification and regression tasks. It has the following objective function: where || · || F is the Frobenius norm, α is a regularization term and the superscript T signifies the transpose of the matrix. The solution of the optimization problem from Equation (1) involves the Moore-Penrose pseudo-inverse [28] and has the following closed-form expression: which for low-dimensional feature spaces (d X , d Y ≤ n) can be calculated explicitly (the I d X in Equation (2) represents the identity matrix of dimension d X ). For high-dimensional data, as in the case for image data, an explicit computation of W as presented in Equation (2) without prior dimensionality reduction is computationally expensive. Fortunately, the closed-form solution can be computed via inversion of the Gram matrix of F X instead of the covariance matrix, given the following relation [28]: We substitute X = F X , R = αI d X , and P = I n where I d X , I n are the d X − and n−-dimensional identity matrices, respectively. Hence, Equation (2) can be rewritten to: Even further, Equation (4) can be augmented by applying the kernel trick. The inputs x i are implicitly mapped to φ(x i ) in a high-dimensional Hilbert space [29]: when predicting a target y new from a new observation x new , explicit access to Φ is never actually needed: With , the prediction can be described entirely in terms of inner products in the higher-dimensional space. Not only does this approach work on the original data sets without the need for dimensionality reduction, but it also opens up ways to introduce non-linear mappings into the regression by considering different types of kernels, such as Gaussian or polynomial kernels.
The schematic representation of the proposed framework for mapping images to text is shown in Figure 2. The image and text features are used as input and output features for the proposed KRR model.

Data Sets
We evaluated the proposed KRR model for mapping images to text on three data sets from different domains. The data sets contain images and text associated with each image. For one data set the text is a description or impression of the image, written by a user. We also used for experimental evaluation a data set in which the text associated with the images is a radiologist's report written in Spanish language. For the third data set the text associated with images is a set of descriptive words, written by a user. We describe these data sets in the following.
Text for Sentiment Analysis (T4SA). This data set was introduced in [30]. The data have been collected from Twitter posts over 6 months, and using an LSTM-SVM (Long-Short Term Memory-Support Vector Machine) architecture, the tweets have been divided into three sentiment categories: positive, neutral, and negative. For image labeling, the authors have selected the data with the most confident textual sentiment predictions, and they used these predictions to automatically assign sentiment labels to the corresponding images. In our experimental evaluation we selected 10 k images and the corresponding 10k tweets from each of the three sentiment categories. Figure 1A shows examples of images and the associated texts from this data set.
PADChest data set. PadChest data set [31] is a public corpus which was collected in Spain at Hospital San Juan from 2009 to 2017. It includes more than 160k X-rays images. The X-rays images were interpreted by radiologists and each image was associated with a report written in Spanish language. 27% from reports were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. In our experimental evaluation we selected 10k images and the corresponding 10k reports. Figure 1B shows samples from this data set.
ESP Game data set. The ESP Game data set [3] is a public data set from Kaggle. The ESP Game is an online game that awarded players points if they could label an image with the same word as another unknown player logged in from a different location. The ESP Game data set consists of 100 k images and each image has a list of words associated with it. The data set was labeled using ESP Game. We selected 10 k images and corresponding 10k descriptions for the experimental evaluation. Figure 1C shows samples of the ESP Game data set.

Image Features
The research on feature extraction from images proceeds along two directions: (i) traditional, hand-crafted features, and (ii) automatically generated features. With the increasing number of images and videos on the web, traditional methods have a hard time handling the scalability and generalization problem. In contrast, automated generated feature-based techniques are capable of automatically learning robust features from a large number of images [32].
To emphasize the advantage of deep-learning techniques for image high-level representation, we compared the performance of four CNN architectures used for image representation with the process of the simply converting of the images into arrays for extracting features from images. Each image was sliced to get the RGB data. The 3-channels RGB image format was preferred instead of using 1-channel image format since we wanted to use all the available information related to an image. Using this approach, each image was described by a 2352 (28 × 28 × 3)-dimensional feature vector.
Deep-learning models use a cascade of layers to discover feature representations from data. Each layer of a convolutional network produces an activation for the given input. Earlier layers capture low-level features of the image like blobs, edges, and colors. These primitive features are abstracted by the high-level layers. Studies from the literature suggest that while using pre-trained networks for feature extraction, the features should be extracted from the layer right before the classification layer [17]. For this reason, we extracted the features from the last layer before the final classification, so the entire convolutional base was used for this.
For understanding the features of an input image and how the networks work, it is important to understand how convolution and pooling layers are calculated. Convolutional parameters can be used for reducing some features in the image which can be ignored in the training process. The following hyperparameters are used for calculating the number of network parameters: number of filters (k), filter width (Fw), filter height (Fh), stride width (Sw), stride height (Sh) and padding (P). To determinate the receptive field (the size of the region in the input that produces the feature) described by output width (Ow) and output height (Oh), the following equations are used: The following formula is used to calculate the pooling layer: where IM is the input matrix, F is the filter and S represents the stride. Starting from the input image and applying the above formulas, the convolutions, pooling and feature map outputs will be obtained. We investigated four different network architectures: VGG16, Inception V3, ResNet50 and Xception. The four architectures differ in the number of layers and the number of parameters. These network architectures were chosen because they are the most popular CNN architectures. Each network has different improvements to the first CNN architecture (AlexNet) which was developed in 2012 [33].

Vgg16
The VGG16 network architecture was introduced in 2014 [34]. VGG16 brings several improvements over AlexNet: fewer parameters, a large number of weight layers, the decision function is more discriminative, to name just a few. The large kernel sized filters from first and the second convolutional layer from AlexNet architectures were replaced with multiple 3 × 3 kernel sized filters in the VGG16 architecture. The VGG16 has a uniform architecture with 16 hidden layers and 138 million of trainable parameters. For computational reasons, while the features were extracted using VGG16 architecture the images were resized to a 3072-pixel resolution. The VGG16 was initialized by the ImageNet weights. Figure 3A shows the graphical representation of the VGG16 network.

Inception
The Inception network was introduced in 2014 (Inception V1 [35]) as the winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [8]. The architecture of this type of network consists of 22 hidden layers and a reduced number of parameters to 7 million (for comparison, AlexNet, the first CNN architecture, has 60 million parameters). Inception networks contain a module called inception module which approximates a sparse CNN with a normal dense construction. Inception V3 network was introduced in [36] and it has 42 hidden layers and 23 million parameters. The architecture of Inception V3 consists of 11 inception modules. Each module is formed by pooling layers, convolutional filters and an activation function which in this case is the rectified linear unit function. Although the image features were extracted using Inception V3 architecture, the input size of the images was resized to 75 × 75 and a-dimensional feature vector of size 3 × 3 × 2048 from the final convolutional layer was returned. Figure 3B shows the graphical representation of the Inception V3 network architecture.

Residual Networks
Residual Networks (ResNets) [37] are a type of classical neural networks that were introduced in 2015 as the winner model of ImageNet challenge (ILSVRC 2015) [8]. A network has a Residual Network architecture if in addition to the convolution, pooling, activation and fully connected layers it has also the identity connection between the layers. A residual block can be represented mathematically as follows: [38] where y is the output function, x is the input to the residual block. W i represents the weight layers contained by residual block, where 1 ≤ i ≤ number of layers in a residual block. If the residual block contains 2 weight layers, the residual block F(x, W i ) can be written as follows: where σ is the ReLU activation function and σ is calculated using the equation: In comparison to VGG networks, the evaluation time was reduced when the residual networks are used. ResNet50 is a convolutional network with a deep of 50 hidden layers and over 23 million trainable parameters. The network requires an image of input size 224 × 224 pixels and 3 input channels, but this size can be lower for computational reasons. In the experimental evaluation, we used an input image of size 32 × 32 × 3 and the ResNets50 returned a 2048-dimensional feature vector. Figure 3C shows the graphical representation of the ResNets architecture.

Xception
The Xception network was introduced in 2016 [39]. The Xception architecture is an improved version of the Inception V3 architecture: the inception modules have been replaced with depthwise separable convolutions. The architecture of Xception network consists of 36 convolutional layers that form the feature extraction base. The Xception network is structured into 14 modules. Except for the first and last modules, each convolutional layer has residual connections around them. The regular input size of images is 224 × 224, but from computational reasons, we used the smallest possible size, 71 × 71. The output of the convolutional base has a size of 3 × 3 × 2048. Figure 3D shows the graphical representation of the Xception network.

Comparison of Network Architectures
For comparison purposes, we designed the graphical representations of the four deep-learning architectures which are shown in the Figure 3. The graphical representations help us to visualize the similarities and differences between the four architectures and to get insights related to them. Although the VGG16 architecture is formed by convolutional, max pooling and fully connected layers, the other three architectures are built by modules and blocks of layers. Instead of stacking convolutional layers, Inception V3, ResNet50 and Xception networks stack modules or blocks, within which are convolutional layers. It is obvious from the graphics of Figure 3 that the architecture of Inception V3 network was improved in comparison to the first two architectures: the types of the layers, the inception modules, the number of layers. The graphical representations indicate that ResNet50 network used batch normalization and the skip connection concept. One can notice that from VGG16 network, which was developed in 2014, network architectures became more complex and they have different improvements to AlexNet-the first CNN architecture: number of layers, using modules, number of pooling layers, activation and loss function, regularization and optimization. One can also notice that even if the number of layers increases, the number of parameters decreases. Table 1 presents the comparison between the network architectures described above, regarding the number of layers and the number of parameters.

Text Features
We used a Bag-of-Words (BoW) model [40] for extracting the features from the text samples. The first step in building the BoW model consists of pre-processing the text: removing non-letter characters, removing the html tags, converting words to lower cases, removing stop-words and making the split. Vocabulary is built from the words that appear in the text samples. The input of the BoW model is a list of strings and the output is a sparse matrix with the dimension: number of samples × number of words in the vocabulary, with 1 if a given word from the vocabulary is contained in that particular text sample. We initialized the BoW model with a maximum of 5000 features. We extracted vocabulary for each data set, and the corresponding 0-1 feature vector for each text sample.

Experimental Protocol
We designed an experimental protocol that would help us answer the following questions:

1.
Could our proposed Kernel Ridge Regression model map images to natural language descriptors? 2.
What is the difference between the four types of network architectures that we considered? Also, we are interested in whether the more complex deep-learning features give a better performance in comparison to the simple RGB pixel-value features. Which of the four deep-learning network architectures performs best? Can we draw some insights from this comparison?
To answer these questions, we designed the following experimental protocol. For each of the three data sets, we randomly split the data 5 times into training and test set, taking 70% from the data set for training and the rest for testing. For training the model, we considered different sizes of the training set: from 50 to 7000 observations with a step size of 50. For a correct evaluation, the models built on these different training sets were evaluated on the same test set. The error was averaged over the 5 random splits of the data into training and test set.

Evaluation Measure
To measure how good our models map images to text, we developed a specific evaluation measure. The reason was that each output of our model represents a very large vector of probabilities, with the dimension equal to the number of words in the dictionary (approximately 5000 components). Each component of the output vector represents the probability of the corresponding word from the vocabulary as being a descriptor of that image. Given this particular form of the output, the evaluation measure was computed as follows: 1. we sorted in descending order the absolute values of the predicted output vector; 2. we created a new vector containing the first 50 words from the predicted output vector; 3. we computed the Euclidean distance between the predicted output vector values and the actual output vector.
The actual output vector is a sparse vector, a component in this vector is 1 if the corresponding word from the vocabulary is contained in that particular description of the image. The values computed in the third step described above were averaged over the entire test data set and the average value obtained was considered to be the error.

Quantitative Analysis
The first questions raised above can be answered by analyzing the experimental results shown in Figure 4. The plots show the learning curve (mean errors and standard deviations) for different sizes of the training set and different sentiment categories. Since the error decreases as the training size increases, it is obvious that there is learning involved, thus our proposed model can map images to natural language descriptors. The plots from Figure 4 also show the comparison between the RGB pixel values and VGG16 features for the three sentiment categories considered. Overall, the more complex deep-learning features give a better performance in comparison to the simple RGB pixel-values features.
Furthermore, we can also see from Figure 4 that the neutral sentiment category has different behavior in comparison with the positive and negative sentiment categories. In the case of neutral sentiment, the more complex VGG16 features have a better performance than the simpler RGB pixel-value features as the size of the data increases. For positive and negative sentiment categories, the simpler RGB pixel-value features lead to an error which varies a lot, while using the VGG16 features, the error is more stable.
The second question can be answered by analyzing the experimental results shown in Figure 5. The plots from Figure 5A show mean errors and standard deviation for different sizes of the training set from the T4SA data set. The more complex ResNet50 features have a better performance than the simpler RGB pixel-value features and than the other three CNN architectures.
The plots from Figure 5B show the learning curve (mean errors and standard deviations) for different sizes of the training set for ESP Game data set using different deep-learning techniques for image extraction. The KRR model performs best when the ResNet50 network is used as image feature extractor, but essentially the more complex deep-learning features give a better performance in comparison to the simple RGB pixel-values-features. Figure 5C shows the comparison between RGB pixel-value features, VGG16 features, ResNet50 features, InceptionV3 features and Xception features for PadChest data set. The proposed model performs better when image features are extracted using deep-learning techniques. Figure 5 shows that our model with ResNet50 as image feature extractor performs best for three data sets. Overall, the more complex deep-learning features give a better performance in comparison to the simple RGB pixel-value features.  The plots from Figure 6 show the comparison of the learning curve for the three data set using ResNet50 as image feature extractor. There is a close value of the error for the ESP Game data set and T4SA data set. The error for PadChest data set whose text is written in the Spanish language is higher compared to the other two data sets whose description is written in the English language. Figure 6. Comparison of the learning performance for the three data sets using the ResNet50 image features.

Qualitative Analysis
To answer the first question from Section 4, we analyzed in more detail the natural language descriptors returned by our proposed KRR model. Figure 7 shows the natural language descriptors returned by our model using the four types of image features that we considered. The mapped image from Figure 7 is from the ESP Game data set. We compared the description returned by our model with the original image description. When our model uses image features extracted with the Xception network, only three words by 20 correspond to the original image description. When the model uses as image feature extraction VGG16, ResNet50 and InceptionV3 networks, five similar words with the original description were returned. However, if we look at the returned words, we can identify many correct image descriptors. For example, using InceptionV3 as image feature extraction, the followings words describe the image: "eyes", "face", "gray", "hands", "person". To generate the descriptors from Figure 7, we considered 7000 observations as the size of the training set. If we considered the size of the training set from 50 to 70 observations with a step size of 50, three words were similar to the original description using ResNet50, InceptionV3 and Xception networks and only two similar words with the original description when VGG16 network was used as image features extractor.
We initialized the BoW model with a maximum of 5000 features. The extracted vocabulary for the ESP Game data set has 5000 words. Figure 8 shows the natural language descriptors generated using our KRR model and the original description of an image from the PadChest dataset. To generate the descriptive words we considered different sizes of the training set: from 50 to 7000 observations with a step size of 50 and an extracted vocabulary containing 3623 words. The vocabulary size is smaller due to the repetition of words in the physician report and this fact may affect the performance of the model for generating similar words.   The qualitative analysis revealed that our proposed KRR model proposed the KRR model performs better on the ESP Game data set, which contains objective descriptions, in comparison to the PadChest and T4SA data sets.
The experiments show that the model performance is better for the data sets whose text are in English. This can be seen in Figure 6 which shows the comparison of the learning performance for the three data sets using ResNet50 network architecture as image feature extractor. For the PadChest data set, whose texts associated with images were written in Spanish, the model returned the largest error in comparison with the other two data sets whose texts were written in English.

Conclusions
In this work, we investigated a method for mapping images to text in different real-world scenarios. The mapping from images to text was performed using a Kernel Ridge Regression model. Several deep-learning approaches were used for image descriptor calculation, including VGG16, Inception V3, ResNet50, and Xception. To confirm the potential of deep-learning techniques for mapping images to text, we considered two types of features: simple RGB pixel-value features and image features extracted with deep-learning approaches. The experimental evaluation showed that the features extracted using different CNN architectures perform better than the RGB pixel-value features. We found that there is a difference in performance for different data sets and different deep-learning architectures, in particular the mapping performs better using ResNet50 as image feature extractor, which has the largest number of hidden layers compared to the other three networks considered for the experimental evaluation. The results show that the model error obtained using the ResNet50 architecture is less by approx. 0.30 than the errors obtained with the other neural network architectures considered.
The experimental evaluation was performed on three data sets from different domains, each data set containing both text and images. We made a comparison between the original text and the generated text by our proposed model. The results showed that the proposed method can predict text that is close to the original one. We investigated the difference between objective and subjective text descriptions of images. Our method generated words more similar to the original descriptions of images for the data set whose text consists of objective descriptors associated with images.
As future work, we plan to further extend our approach by investigating the multimodal machine translation process [41] and to integrate into our model textual captions of images obtained using a pre-trained network [34]. The textual captions could be used as a new type of feature and can be compared and integrated with the other image features considered.