You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

22 May 2023

Infrared Image Caption Based on Object-Oriented Attention

,
,
and
Institute of Unmanned System Research, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Pattern Recognition and Data Clustering in Information Theory

Abstract

With the ongoing development of image technology, the deployment of various intelligent applications on embedded devices has attracted increased attention in the industry. One such application is automatic image captioning for infrared images, which involves converting images into text. This practical task is widely used in night security, as well as for understanding night scenes and other scenarios. However, due to the differences in image features and the complexity of semantic information, generating captions for infrared images remains a challenging task. From the perspective of deployment and application, to improve the correlation between descriptions and objects, we introduced the YOLOv6 and LSTM as encoder-decoder structure and proposed infrared image caption based on object-oriented attention. Firstly, to improve the domain adaptability of the detector, we optimized the pseudo-label learning process. Secondly, we proposed the object-oriented attention method to address the alignment problem between complex semantic information and embedded words. This method helps select the most crucial features of the object region and guides the caption model in generating words that are more relevant to the object. Our methods have shown good performance on the infrared image and can produce words explicitly associated with the object regions located by the detector. The robustness and effectiveness of the proposed methods were demonstrated through evaluation on various datasets, along with other state-of-the-art methods. Our approach achieved BLUE-4 scores of 31.6 and 41.2 on KAIST and Infrared City and Town datasets, respectively. Our approach provides a feasible solution for the deployment of embedded devices in industrial applications.

1. Introduction

Image information entropy can be applied to the field of image processing. An image can be considered a two-dimensional array of pixels in image processing. Entropy can measure the complexity or the amount of information in an image, i.e., how much information is present in the image. For example, in a black-and-white image, if each pixel has only two possible values (black or white), the entropy value will be meager because the information in the image is elementary.
In recent years, with the development of computer technology, various intelligent applications are no longer limited to server-side deployment. The demand for deploying functions such as autonomous object detection, target tracking and positioning, and image captioning on embedded devices is increasing. However, embedded devices often need to consider power supply and consumption issues, making it difficult to directly deploy many large models. Therefore, it is necessary to find a method that is easy to deploy, requires less computing resources, and can maintain good performance.
Image captioning involves describing the contents of an image in sentences, bridging the gap between image processing and natural language processing. It is a critical component of intelligent applications. The encoder–decoder structure holds a dominant position in the field [,]. These methods consist of a cascade of convolutional layers that form an image encoder. The image is then passed through a pre-trained convolutional neural network to carry out the coding process, with the one-dimensional feature vector extracted at the network’s fully connected layer. A recurrent neural network then forms a decoder to fit the one-dimensional feature vector to the embedded annotated sentences.
In industrial applications, infrared sensors are preferred over visible light sensors, as they can capture images in all weather conditions and are resistant to meteorological interference. Infrared sensors measure physical characteristics and can capture images from any object emitting infrared light, provided it has a temperature higher than absolute zero. Unlike visible light sensors, they can be used in low-light conditions and at night, making them particularly useful in extreme weather conditions such as fog, rain, and snow []. They have broad application prospects for night tasks. Therefore, studying image captioning based on infrared images is of significant importance, and has broad practical application value for the deployment of embedded devices. However, there are still many challenges to be addressed, which will be discussed in detail below.
Lacking precision relevant to the established object. In our preliminary experiments, we found that pre-trained models tend to describe the different content in an image, including the relationships and states between different objects. While the results generated in this way are diverse, it is difficult to obtain descriptive results that are highly relevant to the established object. For example, as can be seen from Figure 1, when multiple objects are present, the annotation often describe the image from multiple perspectives, where the results of (3) are typically considered more complete and accurate for describing the behavior of the main object in the image. However, in actual tasks, there are always some objects that we focus on and some that can be ignored. Thus, different methods tend to generate varying results due to the uncertainty of global information. To achieve this effect, some researchers have utilized the semantic attention mechanism [] to align the feature map regions and embedded words. The attention modules tend to learn each word and corresponding image region; the low-level semantic features help the model to select the most essential and relevant image regions when generating a caption. However, for nouns, there are specific image areas that match them, but are inaccurate when directly corresponding verbs, prepositions, and adjectives to image areas. This may result in the generation of redundant information during feedback processing. Moreover, without the high-level information guidance, the model may prioritize visually appealing image regions that are semantically unrelated to the main subject of the image. Therefore, when considering the requirements of practical tasks, we usually want the model to focus on certain predefined objects. It is crucial to incorporate high-level information to guide the model in generating descriptions relevant to the predefined objects.
Figure 1. Different caption results are generated from different methods. (1) Generated from the baseline model []. (2) Generated from the baseline model []. (3) Represents our proposed method. (4) The Ground Truth.
Lacking domain adaptiveness. The performance of a model depends heavily on the support of datasets. Using pre-trained models and fine-tuning their parameters through transfer learning is a widely adopted technique. However, this approach often encounters overfitting caused by sample imbalancing. For instance, in object detection tasks, we frequently use certain object types as targets for transfer learning. The transfer results on the dataset usually exhibit good accuracy. However, during testing, we often observe high false positive rates, i.e., similar features are incorrectly identified as the designated targets. Image captioning faces similar problems. When the images and annotations in the dataset lack diversity, the training tends to favor specific high-frequency vocabulary, resulting in very similar descriptive results and overfitting issues. Introducing more diverse and comprehensive datasets can alleviate this problem. However, most open-source datasets are focused on visible light images. The same object has different features in visible light images and infrared images. In Figure 2, some visible light and infrared images of the same scene and targets are presented, including daytime and nighttime. From the figure, it can be seen that the objects in the visible light images have relatively different textures and colors, while the infrared images contain the contour features of the objects. The different feature distributions of visible light and infrared lead to domain gaps, which seriously affect their performance. On the other hand, most pre-trained models are trained on massive visible light datasets with many instance-level annotations []. When implementing these pre-trained models, it is important to follow a potential rule that ensures the testing and training images are distributed in the same feature space. For instance, if we set the infrared images as the target domain and visible-light images as the source domain, pre-trained models cannot be implemented directly. Certainly, we can also increase the amount of data in the target domain by re-labeling, but this is a very time-consuming and labor-intensive task. Therefore, in order to avoid overfitting caused by sample imbalancing in practical applications, existing open-source datasets should be used to expand the feature space in the target domain and achieve the knowledge transfer. It is essential to enhance the diversity and richness of the target domain data as much as possible.
Figure 2. Infrared images and visible light images. (a) Infrared image; (b) visible light image (night); (c) infrared image; (d) visible light image (daytime).
Based on the above discussion, we followed the encoder–decoder structure and proposed an infrared image caption method. We extracted image features and high-level semantic information based on YOLOv6 [], and treated them as the encoder. Then, we used LSTM to implement decoding. The primary contributions of this paper are as follows:
(1) Firstly, we enhanced the knowledge transfer process by leveraging the approach outlined in reference []. We refined the selection process for pseudo-labels, which has resulted in a more robust detector capable of effectively adapting to changes in images.
(2) Secondly, we introduced the object-oriented attention module, which combines high-level semantic information and image features to weigh the proportion of each component in the multi-level features during word generation. This approach guides the model to generate descriptions that are specifically related to the predefined object.
(3) We provided an infrared dataset of streets in both towns and cities and conducted extensive experiments across multiple datasets to validate the effectiveness of our proposed method. The results from our simulations demonstrate the efficacy of our approach.

3. Methods

In this section, we will provide a detailed description of our approach. Firstly, we introduce the domain transfer method that we used to perform domain adaptation for high-level information extraction. Secondly, we introduce the object-oriented attention module, which enables the generation of words that are explicitly linked to objects detected in an image. By combining the global information provided by the image feature with local information obtained from image regions and object classes, we guide the model through the word generation process. An overview of our method can be found in Figure 3. Firstly, we use domain transfer method to transfer the detector, obtaining the images of the approximate domain through the generative adversarial network, and fine-tune the detector based on this. Then, we use the fine-tuned detector to obtain pseudo-labels in the target domain and finally fine-tune the detector based on the pseudo-labels in the target domain and the labels in the approximate domain, which makes the detector more adaptive to infrared images. Given an infrared image, we can obtain the image features, object features, and corresponding categories based on the detector. These features will be combined with word features to form the training data, and then the model will be trained using the object-oriented attention method with adaptive weighting module, making the model tend to generate statements related to objects.
Figure 3. The overview of our proposed methods.

3.1. Domain Transfer Method

The main differences between visible light images and infrared images lie in their low-level features, such as edges, textures, and colors. Visible light images typically exhibit clear, well-defined objects, while infrared images tend to have lower contrast and only show contour features. In light of these differences, it is important to reduce the distance between the distributions of these two domains. Our approach is based on the semi-supervised learning method, which is illustrated in Figure 4, we improved the pseudo-label learning process based on the domain similarity loss, it helps select the optimal result and is regarded as a pseudo-label. The visible light images can be considered the source domain  D S  and the infrared images can be considered the target domain  D T . We also implemented the CycleGAN [] to alleviate the domain gap.
Figure 4. Domain transfer.
The adversarial neural network processing involved transferring the two different domain images  I  and  J :
G : I J
F : J I
where  G  and  F  are convolutional neural networks. An unsupervised adversarial network was created by combining the two mapping relationships, and trained until it reached equilibrium. The domain transfer between the source domain  D S  and target domain  D T  was achieved using this theoretical method. The main process is as shown below:
i R H 1 × W 1 × 3
D S = i , b , c
D T = i , b , c
where  i  represents the visible light image and  i  represents the visible light image transferred by the adversarial network.  H 1  and  W 1  represent the height and width of visible light images, respectively.  c C  refers to the categories and  b R 4  represents the bounding boxes of the source domain  D S .
After the aforementioned operations, each transferred visible light image  i  in the target domain  D T ’ was distributed as an instance-level annotation. The detector was fine-tuned on  D T = i , b , c  by means of transfer learning. Then, the target domain images in  D T  were input into the fine-tuned detector, producing coarse outputs with a list of confidence scores for each category. We selected the top 5 results with the highest confidence scores as candidate annotations. To determine the best candidate, we cropped the top 5 results based on their respective coordinates and utilized the detector’s backbone to reduce the domain gap and identify the most suitable result.
D T t o p 5 = c o n f t o p 5 , b , c
We utilized the backbone and added the fully connected layer to infer the transferred image patch  f s , extracting the corresponding vector  v s . We also inferred the candidate patches of the target domain  f t m , m = 1,2 , 3,4 , 5 , and extracted the corresponding vectors  v t m , m = 1,2 , 3,4 , 5 . The vectors  v s  and  v t m  were then utilized to calculate the domain similarity loss  l o s s d s  with the formal expression as follows.
v s = Φ f s
v t m = Φ f t m
c o n f v s , v t m = v s · v t m v s 2 v t m 2
l o s s d s = 1 m i = 1 m p i c o n f i 2
where  Φ  is the classifier,  c o n f v s , v t m  represents the cosine distance of  v s  and  v t m c o n f = c o n f 1 , c o n f 2 , , c o n f 5  represents the domain similarity and  p = p 1 , p 2 , , p m  is the similarity label; in our experiment, we set this to 1 and 0. Finally, the highest confidence result was selected and considered as a pseudo-label, which can be formulated as:
j K H 2 × W 2 × 3
D T = j , c
D T b e s t = p b e s t , j , b , c
where  j  represents the infrared image.  H 2  and  W 2  represents the height and width, respectively.  p b e s t  denotes the result with the highest domain similarity and  D T b e s t  refers to the pseudo-labels of the target domain. In the final step, the detector was fine-tuned on  U , which consists of  D T  and  D T b e s t .

3.2. Object-Oriented Attention Module

Following domain transfer, the detector could be employed for infrared images. This section will outline the process of generating image captions from object information and features. Object classes serve as low-level intuitive features, whereas visual features represent high-level semantic features of a deep model. To fuse these low-level and high-level features, a multi-stage feature fusion module has been proposed. The key function of this module is to determine the weight proportion of each part of the multi-level features for word generation.
Given the infrared image  V , let  r j = r 1 , r 2 , , r n  represent the object regions of input image  V  and  c j = c 1 , c 2 , , c n  represent the classes of corresponding object regions, we first reshaped the feature map by flattening its width and height. This can be formulated as:
v = f V
v = F l a t t e n v
r j = F l a t t e n r j
where  r j  represent the object-oriented semantic features,  f  demotes the backbone, and  v  is the encoded image feature extracted by the backbone.
Then, we concatenated the image feature  v  and the  h t 1 c  of LSTM to form the input of attention-LSTM and update the state of the hidden layer.
x t a = c o n c a t h t 1 c , v
h t a = a t t L S T M x t a , h t 1 a
where  x t a  is the concatenated feature,  a t t L S T M  represents the attention-LSTM. The  h t a  represents the updated hidden layer state.
The main function of the adaptive weighting module is to determine the object region features and corresponding classes, which is the crucial part of object-oriented attention, as shown in Figure 5.
Figure 5. The adaptive weighting module.
In the first stage, we concatenated the  r j = r 1 , r 2 , , r n  and  c j = c 1 , c 2 , , c n ; the activated feature vector  α 1  was obtained by the full connection layer and hyperbolic tangent function layer. We computed the normalized weight  α 1  using SoftMax. The weighted feature  r j 1  and  c j 1  was obtained via multiplication with normalized weight  α 1 . The processing can be formulated as follows:
v 1 = c o n c a t r j , c j
α 1 = tanh W v 1 v 1 , W h 1 h t a
α 1 = s o f t m a x α 1
r j 1 = j = 1 n α 1 r j c j 1 = j = 1 n α 1 c j
In the second stage, the weighted feature  r j 1  and  c j 1  obtained in the first stage were concatenated. The structure of the full connection layer, hyperbolic tangent function layer and SoftMax was also included. The whole process was the same as shown in the first stage, and can be formulated as follows:
v 2 = c o n c a t r j 1 , c j 1
α 2 = tanh W v 2 v 2 , W h 2 h t a
α 2 = s o f t m a x α 2
r j 2 = j = 1 n α 2 r j 1 c j 2 = j = 1 n α 2 c j 1
Lastly, we concatenated the refined features of object regions  r j 2  and classes  c j 2 , and input them into LSTM to align with embedded words, thus achieving decoding. The processing is shown in Figure 6, and the calculation formula is as follows:
x t c = c o n c a t h t a , r j 2 , c j 2
h t c = L S T M ( h t 1 a , x t c )
p y t y t 1 = s o f t m a x W p h t c + b p
where  p y t y t 1  represents the conditional probability of each word being generated at  t .
Figure 6. Overview of object-oriented attention.

4. Experiment

In this section, we provide a detailed description of the experimental methodology used in our study. We evaluated our proposed method on three different datasets: Pascal VOC2012 [], KAIST [], and Infrared City and Town. Following the conventional image caption annotation method, we annotated the KAIST [] and Infrared City and Town. The datasets are described below:
Pascal VOC2012 []: The dataset consists of 20 object categories, and the training and validation sets contain a total of 11,530 images with 27,450 ROI-annotated objects. This dataset has been widely used for object recognition and detection tasks.
KAIST []: The dataset is a multispectral pedestrian dataset that contains visible-light and infrared images. The dataset has a total of 103,128 dense annotations and 1182 distinct objects. We annotated the infrared images in this dataset, and for each image, we provided five sentences of manual description.
Infrared City and Town: This dataset was built by us, and contains three main object categories: airplane, car, and pedestrian. We captured the images using infrared equipment in the streets of both cities and towns, including various lighting and weather conditions such as sunny, cloudy, and rainy. We also annotated this dataset using five sentences per image.
Our experiments were divided into two parts:
(1) We validated the effectiveness of domain transfer method on the basis of two sets of experiments: Pascal VOC2012 [] to KAIST [] and Pascal VOC2012 [] to Infrared City and Town.
(2) Through combination with the detector, we validated the effectiveness of the infrared image caption method on KAIST [] and Infrared City and Town, respectively.

4.1. Experimental Details and Metrics

All models were implemented using Python 3.6 and PyTorch 1.9 and trained using an NVIDIA 2080Ti GPU.
For domain transfer evaluation, we used a pre-trained detector. The detector was optimized with SGD and included L2 regularization. The learning rate was set to 0.001. In addition, data augmentation techniques such as random crop, random flip, and random brightness adjustments were used. We evaluated the detector using Precision (Pr), Recall (Re), F1 score (F1), and mean Average Precision (mAP), which are widely used in detection tasks. They can be calculated as follows:
P r = T P T P + F P
R e = T P T P + F N
F 1 = 2 × P r × R e P r + R e
where TP, FP and FN represent the true positive, false positive and false negative, respectively.
For the infrared image caption experiments, we converted all annotations to lower case and removed function words, non-numeric, and non-alphabetic characters that did not provide information in the descriptions. We then counted the occurrences of the remaining words and used words that appeared more than three times to build a dictionary, which was converted to a one-hot vector. A total of 8532 words were used to train the image caption model. The learning rate was set to 0.001 and the batch size was set to 8. The input layer had a dimension of 2048 and the output layer had a dimension of 1024. The class features, region features, and word embedding vectors were set to 512 dimensions. The maximum length of a generated sentence was set to 15. We used publicly available metrics, including BLEU-1, BLEU-2, BLEU-3, BLEU-4, METEOR, and CIDEr. The BLEU-n score is widely used for machine translation, and can be calculated as follows:
BP = 1 , c > r e 1 r c , c r
B L E U = B P × e n = 1 N ω n log p n
where  r  refers to the reference annotation, and  c  refers to the candidate sentence.  ω n  and  P n B  are weights and precision of n-grams.  N = 1 , 2 , 3 , 4 ω n = 1 N .
METEOR is based on word-to-word matching scores, and the calculation is as follows:
P = m c
R = m r
F m e a n = P × R × 10 9 × P + R
M E T E O R = F m e a n × 1 0.5 × c h m
where  P  and  R  are caption precision and recall, respectively.  m  represents the number of matched words. ch refers to chunk, a series of contiguous and identically ordered matches between the generated captions and the annotation.
CIDEr evaluates the consensus between a candidate sentence and annotation and calculates the frequency of n-grams in a candidate sentence based on term frequency–inverse document frequency (TF-IDF). The weighting processing can be formulated as follows:
C I D E r n c i , r i = 1 m j g n c i g n r i j g n c i g n r i j
C I D E r c i , r i = n = 1 N ω n C I D E r n c i , r i
where  r i  refers to the reference annotation and  c i  refers to the candidate’s sentence.  g n  is the vector consisting of all n-grams of length n.

4.2. Quantitative Results of the Domain Transfer Method

In our approach, we use detectors to extract image features and high-level semantic information. Therefore, in this section of the experiment, we began by validating the effectiveness of the domain transfer method. We selected widely used models such as FasterRCNN [], SSD [], YOLOv4 [], and YOLOv6 [] as comparison methods and chose the common categories “Car and pedestrian” in datasets as the transfer target. The backbone for FasterRCNN [] and SSD [] is Resnet50, and YOLOv4 [] and YOLOv6 [] both use the m-size.
First, we conducted the experiment of Pascal VOC2012 [] to KAIST [], where we used Pascal VOC2012 [] as the source domain and KAIST [] as the target domain to observe the performance changes. Table 1 shows that YOLOv6m+DT (domain transfer) achieved precision, recall, F1, and mAP of 58.47%, 60.33%, 59.39%, and 62.47%, respectively. Compared to the method without domain transfer, precision, recall, F1, and mAP increased by 35.33%, 37.76%, 36.54%, and 38.90%, respectively. Similar improvements were observed in other detectors, such as YOLOv4m+DT and YOLOv4m, which increased mAP by 35.47%. The two-step detector also showed improvement, with FasterRCNN_Resnet50+DT achieving an mAP about 32.85% higher than FasterRCNN_Resnet50.
Table 1. Detection results of Pascal VOC2012 to KAIST/%.
In another set of experiments, we employed Pascal VOC2012 [] on the Infrared City and Town dataset. We used Pascal VOC2012 [] as the source domain and Infrared City and Town as the target domain, and we observed similar patterns. As shown in Table 2, we obtained precision, recall, F1, and mAP of 81.64%, 80.93%, 81.28%, and 83.16%, respectively. Compared to the method without domain transfer, precision, recall, F1, and mAP increased by 50.82%, 49.76%, 50.29%, and 48.19%, respectively. The SSD_Resnet50 had poorer performance, but also achieved an improvement of 47.11% of mAP. The detectors performed better in Infrared City and Town than in KAIST []. After analyzing the results, we found that many objects in Infrared City and Town have a relatively transparent background compared to KAIST []. Therefore, it is easier to distinguish the foreground from the background.
Table 2. Detection results of Pascal VOC2012 to Infrared City and Town/%.
In both experiments, both single-step and two-step detectors improved when combined with the domain transfer method. The feature distribution of the detectors switched from visible light to infrared after domain transfer using transfer learning, making the detectors more adaptable to infrared object features. Based on the above discussion, the experimental results demonstrate the effectiveness of our domain transfer method. We achieved domain adaptation through the adversarial model and fine-tuned the detectors through pseudo-learning. Ultimately, the detector trained on visible-light datasets demonstrated good domain adaptability on infrared datasets.

4.3. Quantitative Evaluation Results for Infrared Image Caption on KAIST

In this section of the experiment, we compare our method with several existing image caption models. We divided them into four types: (1) Neural network-based methods, such as Vgg16+RNN, Vgg16+LSTM, Neural Baby Talk [], Google NIC [], and Noc []; (2) Transformer-based methods, such as M2 Transformer [], Unified VLP [] and RSTNet []; (3) Attention-based methods, such as soft attention [], semantic attention [], Yu et al. [], OGA [], C-LSTM [], and our proposed method; and (4) Multimodal-based methods, such as mPLUG [] and OFA []. The structural composition of the model in each of the methods described above is presented in Table 3. All methods were set up with consistent basic settings. All experiments were conducted offline using two NVIDIA 2080ti GPUs. The learning rate for the encoder was set to 0.001, and the learning rate for the decoder was set to 0.004. Both the encoder and decoder were optimized using the Adam method, and training was conducted for 150 epochs. The same training set was used for all experiments, and no data augmentation was performed in this part. The maximum length for generating words was set to 15.
Table 3. Structural composition of the models.
Table 4 shows the corresponding infrared image captioning performances on KAIST []. Our proposed method obtained BLUE-4, METEOR and CIDEr scores of 32.6%, 26.8%, 111.2%, respectively. These were the highest scores on both datasets for all metrics. Our method outperformed neural network-based methods such as Google NIC [] and Noc [] by 6.5% and 4.7% in BLUE-4. Moreover, compared to other attention-based methods such as soft attention [], semantic attention [], Yu et al. [], and OGA [], our object-oriented attention method performed better, utilizing local object regions and high-level information fully, and demonstrating more than 3.0%, 2.1%, 6.1%, and 6.7% improvements in METEOR, respectively. While our method achieved significant advantages compared to the Neural network- and Attention-based methods, it achieved relatively similar results to those achieved by the Transformer-based methods and the Multimodal-based methods. For instance, compared to RSTNet [] (ResNext152), our method achieved an improvement of 3.7% and 2.9% for BLEU-4 and METEOR, respectively. In addition, our method outperformed M2 Transformer [] and Unified VLP [] by 4.0% and 3.4% in BLUE-4, respectively. Additionally, compared to mPLUG [] and OFA [], our method scored 1.2% and 2.1% higher in METEOR. Although the performance of the transformer-based methods and the multimodal-based methods was very close to ours, our method has a simpler structure, requires less computational resources, has faster inference speed, and is more compatible with deployment on embedded platforms. Our proposed method also achieved a higher score than the others in terms of CIDEr, illustrating the similarity of the sentence to the ground truth. This suggests that our method can clearly state the objects in an image. These observations indicate that image captioning based on object regions’ semantics improves model performance significantly compared to describing global semantics. Our proposed method eliminates redundant information from other image areas and focuses more on the object.
Table 4. Performance of the proposed model on the KAIST compared with other models/%.

4.4. Quantitative Evaluation Results for Infrared Image Caption on Infrared City and Town

Similar to the previous section, we continued to use the methods mentioned earlier for testing on the self-built dataset. Table 5 shows the comparison experiment results on Infrared City and Town. The performance of each evaluation metric improved. Our proposed method obtained BLUE-4, METEOR and CIDEr scores of 42.2%, 36.9%, 127.3%, respectively. Compared to C-LSTM [], BLUE-1, BLEU-2, BLEU-3 and BLEU-4 increased by 2.4%, 3.6%, 2.8% and 4.6%, respectively. Additionally, compared to Google NIC [], our method outperformed it by 7.4%, 9.8%, 6.4% and 7.0% (BLUE-1, BLEU-2, BLEU-3, and BLEU-4, respectively). For the classical Neural Talk method [], our method outperformed the other models by almost 11.3% and 16.0% in terms of METEOR and CIDEr, respectively. The proposed object-oriented attention method also obtained a better performance than other attention methods, such as soft attention [] and semantic attention []: by more than 7.3% and 5.1% for BLUE-4 and more than 7.8% and 6.2 for METEOR. As in the previous section, the results achieved by the methods based on the transformer and the multimodal were very close to those achieved using our method. For example, our method outperformed mPLUG [] and Unified VLP [] by 1.6% and 2.6% for BLUE-4, and 1.3% and 2.2% for METEOR, respectively. In addition, compared to RSTNet [] (ResNext101) and RSTNet [] (ResNext152), METEOR improved by 3.0% and 2.3%, and BLEU-4 improved by 3.6% and 2.8%, respectively. However, our method still maintained a good performance in terms of CIDEr. The adaptive weighting module can fully exploit the potential relevance of the class feature, region feature, and embedded word vector, and can improve the performance of image captioning by enabling interaction between each component of the visual region feature. The introduction of this module breaks down the isolation of object regions and high-level information in the image, revealing the semantic relevance of each image region by comprehensively considering the location and content relevance. Regions with high semantic relevance can assist each other in generating words, which effectively improves the model’s ability to understand the image content. Based on the above analysis, the object-oriented attention method can guide the allocation of weights and provide more useful information for the text generation model. It is easy to conclude that the image captioning method based on object-oriented attention can enhance the model’s ability to understand regional relationships and improve the description generation performance.
Table 5. Performance of the proposed model on the Infrared City and Town, compared with other models/%.

4.5. Quantitative Evaluation Results and Embedded Platform Porting

Figure 7 displays examples of image caption results with their corresponding annotations. We selected methods from each of the four types as the object of comparative simulation testing, and these are: Google NIC [], Semantic attention [], M2 Transformer [] and mPLUG []. Our method generates descriptions that are relevant to the objects in the images. The annotations attempt to describe the image’s complex semantic content. For instance, in the second infrared image from left to right, the annotations include “pedestrians”, “cars”, “trees”, and “buildings”. The sentences describe these objects separately, but the crucial aspects of the foreground are the “pedestrians” and “cars”. The other elements can be considered as background. In our opinion, the description of an image’s content should prioritize the foreground. Our method can describe both objects based on the domain-transferred detector and object-oriented attention.
Figure 7. Infrared image captions results. (a) Urban scenery; (b) urban scenery; (c) rural scenery; (d) mountainous scenery.
Moreover, for the last image from left to right, the annotations contain different topic sentences, such as “mountain”, “tunnel”, and “airplane”. Our method focuses on the object itself, accurately generating the most relevant description of the “airplane”. These examples demonstrate that our method can accurately describe the complex semantic information of an infrared image, and we have achieved similar performances on both examination datasets. Based on the above discussion, the qualitative results further support the effectiveness of the proposed method.

5. Conclusions

In this paper, a method is proposed for generating captions for infrared images based on object-oriented attention. Our approach involves two models: a detector and an LSTM. We first fine-tune the detector on visible-light images that have undergone style transfer. Then, we utilize the fine-tuned detector to acquire pseudo-labels on the target domain with image-level annotation. Lastly, we fine-tune the detector based on the pseudo-labels and visible-light images that have undergone style transfer to obtain the final detector. Notably, to address the feature differences between visible light and infrared images, we propose the domain similarity loss, which optimizes the selection process of the pseudo-label, expands the range of the target domain distribution, and improves the adaptability of the detector. The transferred detector enables the LSTM to select the most relevant regions in the foreground and eliminate redundant semantics, resulting in more accurate and robust descriptions. We also introduce an object-oriented attention module for the LSTM that uses object classes and regions as guiding information to align corresponding embedded words. The resulting descriptions are more accurate and robust due to the high-level information guidance. We conduct comprehensive experiments on two infrared datasets, and the results demonstrate the effectiveness of our approach. Furthermore, our approach is suitable for implementation on embedded devices, as it requires fewer resources and is convenient to deploy.

Author Contributions

Conceptualization, J.L. and T.H.; Data curation, Y.X.; Investigation, T.H.; Methodology, J.L.; Resources, Y.X.; Software, J.L. and T.H.; Supervision, Y.Z.; Validation, Y.Z. and Y.X.; Visualization, J.L.; Writing—original draft, J.L.; Writing—review and editing, J.L. and T.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The Pascal VOC2012 and KAIST datasets are openly available in a public repository. They can be downloaded at http://host.robots.ox.ac.uk/pascal/VOC/voc2012/, accessed on 10 May 2022. and https://github.com/SoonminHwang/rgbt-ped-detection/blob/master/data/README.md, accessed on 1 March 2022. The Infrared City and Town dataset is available on request from the authors. The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 652–663. [Google Scholar] [CrossRef] [PubMed]
  2. Lu, J.; Yang, J.; Batra, D.; Parikh, D. Neural Baby Talk. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7219–7228. [Google Scholar] [CrossRef]
  3. Wang, G.; Tao, B.; Kong, X.; Peng, Z. Infrared Small Target Detection Using Nonoverlapping Patch Spatial–Temporal Tensor Factorization with Capped Nuclear Norm Regularization. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  4. Li, B.; Zhou, Y.; Ren, H. Image Emotion Caption Based on Visual Attention Mechanisms. In Proceedings of the 2020 IEEE 6th International Conference on Computer and Communications (ICCC), Chengdu, China, 11–14 December 2020; pp. 1456–1460. [Google Scholar] [CrossRef]
  5. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2012 Results. Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ (accessed on 10 May 2022).
  6. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  7. Hsu, H.-K.; Hung, W.-C.; Tseng, H.-Y.; Yao, C.-H.; Tsai, Y.-H.; Singh, M.; Yang, M.-H. Progressive domain adaptation for object detection. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 738–746. [Google Scholar]
  8. Venugopalan, S.; Hendricks, L.A.; Rohrbach, M.; Mooney, R.; Darrell, T.; Saenko, K. Captioning Images with Diverse Objects. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1170–1178. [Google Scholar] [CrossRef]
  9. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv 2015, arXiv:1502.03044. [Google Scholar]
  10. Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6298–6306. [Google Scholar]
  11. Yu, N.; Hu, X.; Song, B.; Yang, J.; Zhang, J. Topic-Oriented Image Captioning Based on Order-Embedding. IEEE Trans. Image Process. 2019, 28, 2743–2754. [Google Scholar] [CrossRef] [PubMed]
  12. Zhu, Z.; Xue, Z.; Yuan, Z. Topic-Guided Attention for Image Captioning. In Proceedings of the 25th IEEE International Conference on Image Processing, Athens, Greece, 7–10 October 2018; pp. 2615–2619. [Google Scholar]
  13. Chen, F.; Xie, S.; Li, X.; Li, S.; Tang, J.; Wang, T. What Topics Do Images Say: A Neural Image Captioning Model with Topic Representation. In Proceedings of the 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shanghai, China, 8–12 July 2019; pp. 447–452. [Google Scholar]
  14. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  15. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. ECCV 2020; Springer: Cham, Switzerland, 2020. [Google Scholar] [CrossRef]
  16. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
  17. Wang, J.; Chen, Z.; Ma, A.; Zhong, Y. Capformer: Pure Transformer for Remote Sensing Image Caption. In Proceedings of the 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 7996–7999. [Google Scholar] [CrossRef]
  18. Liu, C.; Zhao, R.; Chen, H.; Zou, Z.; Shi, Z. Remote Sensing Image Change Captioning with Dual-Branch Transformers: A New Method and a Large Scale Dataset. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–20. [Google Scholar] [CrossRef]
  19. Liu, C.; Zhao, R.; Shi, Z. Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  20. Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-Memory Transformer for Image Captioning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10575–10584. [Google Scholar] [CrossRef]
  21. Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; Gao, J. Unified Vision-Language Pre-Training for Image Captioning and VQA. arXiv 2019, arXiv:1909.11059. [Google Scholar] [CrossRef]
  22. Li, C.; Xu, H.; Tian, J.; Wang, W.; Yan, M.; Bi, B.; Ye, J.; Chen, H.; Xu, G.; Cao, Z.; et al. mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. arXiv 2022, arXiv:2205.12005. [Google Scholar]
  23. Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. arXiv 2022, arXiv:2202.03052. [Google Scholar]
  24. Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; Ji, R. RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15460–15469. [Google Scholar] [CrossRef]
  25. Yao, T.; Pan, Y.; Li, Y.; Mei, T. Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5263–5271. [Google Scholar] [CrossRef]
  26. Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar] [CrossRef]
  27. Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
  28. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  29. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector, In ECCV; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  30. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.