A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow

Zhao, Zhenqiang; Shen, Helong; Wang, Meng; Wang, Yufei

doi:10.3390/jmse13071204

Open AccessArticle

A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow

by

Zhenqiang Zhao

¹

,

Helong Shen

^1,*

,

Meng Wang

² and

Yufei Wang

³

¹

Navigation College, Dalian Maritime University, Dalian 116026, China

²

Public Security Technology R&D Center, Liaoning Police College, Dalian 116036, China

³

Zhilong (Dalian) Marine Technology Co., Ltd., 11th Floor, No. 523 Huangpu Road, Dalian 116020, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(7), 1204; https://doi.org/10.3390/jmse13071204

Submission received: 23 May 2025 / Revised: 19 June 2025 / Accepted: 20 June 2025 / Published: 21 June 2025

(This article belongs to the Special Issue Unmanned Marine Vehicles: Perception, Planning, Control and Swarm—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The environmental perception capability of intelligent ships is essential for enhancing maritime navigation safety and advancing shipping intelligence. Image caption generation technology plays a pivotal role in this context by converting visual information into structured semantic descriptions. However, existing general purpose models often struggle to perform effectively in complex maritime environments due to limitations in visual feature extraction and semantic modeling. To address these challenges, this study proposes a transformer dual-stream information (TDSI) model. The proposed model uses a Swin-transformer to extract grid features and combines them with fine-grained scene semantics obtained via SegFormer. A dual-encoder structure independently encodes the grid and segmentation features, which are subsequently fused through a feature fusion module for implicit integration. A decoder with a cross-attention mechanism is then employed to generate descriptive captions for maritime images. Extensive experiments were conducted using the constructed maritime semantic segmentation and maritime image captioning datasets. The results demonstrate that the proposed TDSI model outperforms existing mainstream methods in terms of several evaluation metrics, including BLEU, METEOR, ROUGE, and CIDEr. These findings confirm the effectiveness of the TDSI model in enhancing image captioning performance in maritime environments.

Keywords:

intelligent ships; image captioning generation; transformer

1. Introduction

In recent years, the deep integration of artificial intelligence and maritime technologies has positioned smart ships as a key focus for transforming and upgrading the global shipping industry. The Code for Intelligent Ships, issued by the China Classification Society in 2025 [1], clearly outlines four core capabilities required of intelligent ships: perception, memory and reasoning, learning and self-adaptation, and behavioral decision-making. Among these, memory and reasoning serve as the foundation for system–environment interaction and directly influence the depth of understanding and accuracy of response to complex maritime scenarios. In this context, image captioning has emerged as a key technology for enhancing the reasoning capability of intelligent ships by converting visual information into structured semantic descriptions. This technology not only allows smart ships to understand and utilize image data more effectively but also opens up new possibilities for human–machine interaction. By generating image-based captions, intelligent ships can improve environmental awareness and enhance navigational safety.

In some traditional research aspects, such as target recognition and detection tasks at sea under low visibility [2], traffic flow prediction tasks [3], control tasks [4], etc., all of them guarantee the safety of navigation in one way or another, and image description technology, as an intersection of the fields of computer vision and natural language processing, also has an important potential in guaranteeing the safety of navigation. With the development of deep learning techniques, notable progress has been made in image caption generation research; however, most image caption generation models are based on English, with relatively limited studies focused on Chinese. Research on Chinese image caption generation still faces many challenges. In addition, most existing studies focus on general scenarios, while algorithm optimization and dataset development tailored to the maritime environment remain largely unexplored. The unique characteristics of the maritime environment result in notable domain-adaptability issues for general image captioning models, limiting their practical effectiveness on smart ships. Currently, most existing methods follow the encoder architecture, using visual features extracted by convolutional neural networks (CNNs) as inputs and further encoding them into a visual–language space. However, these methods have a major drawback: the visual information extracted by the visual feature extractor is often insufficient or inaccurate, with critical details being overlooked. Semantic segmentation, which refers to the attribute-based division of each pixel in an image, can provide richer contextual information for image captioning algorithms. It can compensate for the insufficiency of visual information, help avoid description errors, and enhance the accuracy and completeness of generated descriptions.

This study addresses the key challenges in maritime image captioning by proposing a dual-information-flow architecture based on the Transformer model, which constructs both a Maritime Semantic Segmentation dataset and a Maritime Image Captioning dataset. To enhance the system’s ability to capture maritime scenes, this study incorporates a fusion of segmentation features and gird features. A cross-attention mechanism is introduced to optimize semantic scene modeling, thereby improving the accuracy and professionalism of the generated semantic descriptions. The proposed approach successfully realizes image captioning for a portion of maritime traffic scenarios. The experimental results show that the effectiveness of the TDSI model we propose is due to other models.

The primary contributions of this study are enumerated below.

1.: Addressing key challenges in maritime image captioning: we propose a dual-information-flow architecture based on the Transformer model, which integrates segmentation features and grid features. We introduce a cross-attention mechanism to optimize semantic scene modeling, thereby enhancing the accuracy and professionalism of the generated semantic descriptions.
2.: We propose the construction of two maritime image datasets: the Maritime Semantic Segmentation Dataset (MSS) and the Maritime Image Description Dataset (MIC). The establishment of these two datasets provides diverse real-world data for the fields of maritime image description and maritime scene semantic segmentation, thereby promoting the development of these fields.
3.: We designed a multimodal feature fusion module to achieve the implicit fusion of multimodal features. The parameters of the shared MHA (Multi-Head Attention Mechanism) layer and PWFF (Position-wise Feedforward Network) layer are shared, while maintaining the modality specificity of the batch normalization layer, thereby minimizing parameter increase and enhancing the fusion effect between features.

This study not only fills a research gap in image captioning technology within the field of navigation but also provides more explanatory and reliable support for environment perception in intelligent ships, which holds considerable theoretical value and practical importance for the development of intelligent maritime shipping.

2. Related Work

Currently, image caption generation algorithms are broadly categorized into three types: template-based methods, retrieval-based methods, and deep learning-based methods.

The template-based image caption generation method typically involves detecting targets in an image, identifying their attributes and relationships, and then inserting the detected information into predefined templates with blanks to generate descriptive text of the image. This approach generally consist of two steps: image detection and text generation. A representative example is the model proposed by Farhadi et al. [5] in 2010, which extracts all possible elements from an image—including objects, actions, and scenes—using a target detection algorithm. It then employs a conditional random field (CRF) algorithm to embed the most appropriate ternary information into a manually designed language template to generate image captions. Kulkami et al. [6] improved upon this study by proposing the BabyTalk model that uses a target detection algorithm to identify as many sets of target-related information as possible and applied a CRF to fill in the templates with the correct triplet values, thereby generating complete image descriptions. Although the template-based approach is simple, intuitive, and produces grammatically correct captions, it heavily relies on human-designed syntactic templates. The limited visual comprehension of machines and the fixed language templates result in captions that may not closely align with the image content. In addition, these methods often lack semantic diversity and flexibility in expression.

The retrieval-based image caption generation method generates descriptions by matching images from a database and associating them with the relevant text. The core process involves retrieving candidate images from the database using visual feature similarity, then extracting and combining the corresponding textual information to generate descriptions. Typical studies in this area include the method proposed by Kuznetsova et al. [7], which retrieves visual entity phrases by extracting key elements through target detection and matching them to corresponding text phrases. Ordonez et al. [8] improved the similarity algorithm to optimize retrieval accuracy by integrating multimodal features—such as objects and scenes. Socher et al. [9] developed a DT-RNN model that parses syntactic structures in vector space to enhance semantic alignment. Jacob et al. [10] used features extracted by the VGG network in combination with KNN [11] retrieval to generate descriptions. Although these methods can produce grammatically standardized descriptions, they suffer from several limitations: a heavy reliance on annotated corpora (requiring the maintenance of large-scale, high-quality datasets), poor adaptability to novel contexts (inability to generate new scene descriptions), and weak semantic associations (increased risk of content bias), all of which restrict their application in complex scenarios.

Oriol Vinyals et al. [12] proposed an image caption generation model (referred to as NIC) based on an encoder–decoder framework [13] during the COCO Challenge. The model employed the then most effective CNN, Inception [14], as the encoder and a recurrent NN, LSTM [15], as the decoder. This approach won the MSCOCO Challenge and brought notable public attention to the task of image caption generation for the first time. Since then, research in English-language image caption generation has rapidly progressed. Kelvin Xu et al. [16] were the first to incorporate an attention mechanism into the image caption generation task. They proposed two attention models: soft attention and hard attention, both of which outperformed the NIC model by utilizing shallower and more specific features extracted using the VGG Net [17]. By employing these shallower and more specific features, the two models achieved superior performance compared to the NIC model.

Subsequent studies by Long Chen et al. [18] and Jiasen Lu et al. [19] further enhanced the application of attention mechanisms in generating image descriptions by proposing its use to control both the regions of the model’s focus on the image and the intensity of that attention. Siqi Liu et al. [20] were the first to apply a reinforcement learning algorithm to address the image caption generation problem. Building upon this, Steven J. Rennie et al. [21] proposed the SCST method. This method addressed two key issues: exposure bias and the discrepancy between assessment metrics and cross-entropy. Their approach has been widely validated and is recognized for greatly improving CIDEr scores while steadily enhancing the overall quality of generated descriptions.

Anderson et al. [22] were the first to apply the target detection pre-training model as an image feature extractor in image caption generation, leading to the development of the Up–Down model. Compared with traditional image classification models, object detection models provide a more fine-grained and accurate feature information, greatly improving the captioning performance of the final model.

Marcella Cornia et al. [23] introduced the meshed-memory transformer model, using a modified transformer architecture as the language generator. Unlike traditional recurrent NNs, the transformer [24] operates entirely through attention mechanisms, allowing unbiased relationships between tokens and making it well-suited for generative tasks. Traditional attention mechanisms typically use linear fusion to model cross-modal feature interactions, which only captures first-order feature interactions between unused modalities. This limitation significantly reduces the effectiveness of attention mechanisms in cross-modal content reasoning tasks like image caption generation. To overcome this limitation, Pan et al. [25] proposed the X-Linear Attention mechanism, which utilizes bilinear pooling to capture higher-order interactions between visual and textual features, thereby enhancing output representations. Ji et al. [26] designed a global-augmented encoder to extract global image features and a global-adaptive decoder to incorporate these features effectively into caption generation. The COS-Net [27] model performs cross-modal retrieval via CLIP, obtaining semantically similar sentences in the training set as “initial semantic cues” to solve the problem of incomplete representation caused by the limited semantic categories of traditional visual coders. However, it relies on the CLIP model as the semantic retriever; the structure is complex and the inference cost is high. DFT [28] is different from previous models that only perform modal fusion on the encoder or decoder side; DFT introduces fusion mechanisms at both the encoder and decoder side to achieve a more complete information interaction, and achieves higher scores.

In addition, Zhuang et al. [29] offer a transformer-based dense captioner (TDC), a revolutionary architecture for learning image mapping and dense captions while prioritizing informative regions. It introduces a region–object correlation score unit (ROCSU) for determining importance. Experiments demonstrate TDC’s superiority over conventional approaches. Thao et al. [30] investigated the impact of generated captions on web-scraped data points with nondescript text. The study also highlights the limitations of synthetic text and the importance of image curation with increasing training data quantity. Zhuang et al. [31] introduced an enhanced dense captioning architecture called Enhanced Transformer Dense Captioner (ETDC), which dynamically diversifies the vocabulary bank during captioning. It incorporates a Textual Context Module and a Dynamic Vocabulary Frequency Histogram re-sampling strategy, outperforming state-of-the-art methods in mean Average Precision. Zhuang et al. [32] proposed an end-to-end dense captioning system based on multi-scale transformer decoding (DCMSTRD). DCMSTRD solves dense captioning by set matching and prediction instead.

In terms of different languages, the work done by Subedi & Bal [33] focuses on bridging the gap for image caption generation in the low-resource Nepali language. To achieve better results, they have utilized two models, which are the Convolutional Neural Network and Transformer base architecture model. Solomon & Abede [34] developed an integrated deep learning techniques model that offers Amharic-language image captions with semantic significance. To recognize significant components in the images, an auditory attention mechanism is applied. The images’ captions are produced using a bidirectional gated recurrent unit (Bi-GRU) that includes an attention decoder. The work by Chethas et al. [35] generates image captions in the Kannada language by using various architectures and deep learning models.

There have also been studies focusing on the diffusion modeling of image captions [36,37]; although these methods can generate diverse captions and better handle noise and uncertainty, they are difficult to implement and train and have high computational cost and long computational time.

3. Research Methodology

Herein, we propose a Transformer-based image caption generation model, transformer dual-stream information (TDSI), which integrates semantic segmentation and visual feature fusion. The overall architecture is illustrated in Figure 1. The model first extracts grid features (

V_{g}

) and segmentation features (

V_{s}

) from the input image (I), based on using a grid feature extractor and a segmentation feature extractor, respectively. Since mesh and mask visual features possess distinct visual representational properties, they are encoded separately using a transformer encoder with nonshared parameters. A fusion module is then employed to implicitly combine the segmentation and mesh features, and a cross-attention mechanism is introduced to further fuse the two feature streams. The process can be mathematically represented as follows, Equations (1) and (2):

V_{s}^{″}, V_{g}^{″} = Encoder (I)

(1)

O = Transformer_Decoder ({V_{s}}^{″}, {V_{g}}^{″}, ω)

(2)

where I is the input image;

ω

is the description statement; “

V_{s}^{″}, V_{g}^{″}

” are segmentation features and grid features after implicit fusion, respectively; and O is an output word probability distribution over a predefined vocabulary dictionary.

3.1. Transformer Architecture

This study employs the widely used transformer [24] model, which captures sequence dependencies entirely through an attentional mechanism, addresses the long sequence dependency problem, and enables parallel computation, significantly enhancing prediction accuracy and training speed. As shown in Figure 1, the Transformer model follows an encoding–decoding architecture, with the grid and segmentation feature encoder on the left containing L encoder blocks and the decoder on the right consisting of L decoder blocks. Both components include essential elements such as multi-head attention, add and norm, and feed-forward network.

In the encoding part, because there is no fixed order relationship between different regions of the image, the image feature encoding sequence models the relationship between features through the Transformer full attention mechanism. In the decoding part, we introduce the positional encoding operation because the input text is arranged in a certain order. This allows the model to exploit the sequential information effectively. To achieve this, we include information about the relative or absolute position of the characters in the sequence, Equations (3) and (4).

P E_{(p o s, 2 i)} = sin (p o s / 10000^{2 i / d_{m o d e l}})

(3)

P E_{(p o s, 2 i + 1)} = cos (p o s / 10000^{2 i / d_{m o d e l}})

(4)

where

p o s

is the position and i is the dimension. Each dimension of the position encoding corresponds to a sinusoidal curve. The wavelengths form a geometric series from

2 π

to

10000 - 2 π

. We chose this function because we assumed that it would enable the model to learn relative positions more easily; as for any defined offset k,

P E_{p o s + k}

can be expressed as a linear function of

P E_{p o s}

.

3.1.1. Multi-Head Attention Mechanism

The multi-head attention mechanism divides the high-dimensional query matrix (Q), key matrix (K), and value vector matrix (V) into different subspaces. Each subspace independently calculates similarity to ensure subspace independence. The results are then merged through a transformation method. The calculation formula can be expressed as follows, Equation (5):

\{\begin{matrix} M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{h}) W_{0} \\ h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{matrix}

(5)

where the mappings are the parameter matrices:

W_{i Q} \in R_{d_{m o d e l} \times d_{k}}, W_{i K} \in R_{d_{m o d e l} \times d_{k}}, W_{i V} \in R_{d_{m o d e l} \times d_{v}}, W_{Q} \in R_{h d_{v} \times d_{m o d e l}}

3.1.2. Position-Wise Feed-Forward Networks

Following the attention operation, each encoder and decoder layer includes a fully connected position-wise feed-forward network, which applies the same transformation to each position vector. This network consists of two linear transformations with an ReLU activation function in between. Its primary role is to provide nonlinear transformations to improve the model’s ability to learn complex mappings, Equation (6).

F F N (x) = m a x (0, x W_{1} + b_{1}) W_{2} + b_{2}

(6)

Each layer uses a distinct set of parameters.

3.2. Grid Feature Extraction Network

In recent years, some Transformer-based variant models, particularly the Swin-transformer [38], have demonstrated improved performance in grid visual feature extraction and have been successfully applied to image captioning tasks. Herein, we also adopted the pre-trained Swin-transformer as the feature extractor. The input image is first segmented into nonoverlapping blocks using a chunking module. Each chunk is then linearly embedded into a low-dimensional space to generate chunk embedding vectors across four downsampling stages, corresponding to resolutions of ×4, ×8, ×16, and ×32. Except for the first stage, which employs a linear embedding layer, the remaining three stages are processed with multilayer Swin-transformer blocks. These blocks employ window-based polytope self-attention and displacement polytope self-attention mechanisms to extract the final grid features, which are computed as follows, Equation (7):

V_{g} = Swin_Transformer (I)

(7)

where

V_{g} \in R^{h \times w \times d}

; h and w denote the height and width of the feature map, respectively, and d is the feature dimension.

3.3. Segmentation Feature Extraction Network

Segmented feature extraction networks use SegFormer semantic segmentation [39] to extract key semantic features, which not only mitigates the visual confusion problem but also compensates for the lack of grid feature representation capability. SegFormer is an advanced transformer framework for semantic segmentation that combines a transformer with a lightweight multilayer perceptron decoder, while efficiency, accuracy and robustness are considered. It features a hierarchical transformer encoder without positional encoding and is designed with a lightweight all-connected multilayer perceptron (All-MLP) decoder that generates robust representations without the use of complex and computationally demanding modules.

The SegFormer framework comprises two primary modules: a hierarchical transformer encoder for extracting both coarse and fine-grained features and a lightweight All-MLP decoder for directly fusing these multilevel features to predict semantic segmentation masks (

V_{s}

), which are processed as follows, Equation (8):

V_{s} = Segformer (I)

(8)

where

V_{s} \in R^{c \times h \times w}

; h and w denote the height and width of the feature map, respectively, and c denotes the number of categories.

3.4. Encoder Module

3.4.1. Encoding Network

The encoder module comprises the transformer layer and the specially designed feature fusion module. The transformer layer is tasked with encoding the acquired segmentation features and mesh features for subsequent integration within the feature fusion module, thereby facilitating an implicit fusion of the two. The structure of this module is illustrated in Figure 2.

The process can be expressed as follows:

V_{g}^{'} = Transformer_Layer (V_{g})

(9)

V_{s}^{'} = Transformer_Layer (V_{s})

(10)

3.4.2. Fusion Module

Ref. [40] and others have confirmed that multimodal features can be learned within a single shared network by retaining only modality-specific batch normalization layers in the encoder. Furthermore, using shared parameters promotes the implicit fusion of multimodal features through internal interaction, resulting in performance improvements. Consequently, we share the parameters of the MHA (multi-head attention mechanism) layer and the PWFF (position-wise feed-forward networks) layer while maintaining the modal specificity of the batch normalization layer to minimize the increase in parameters and enhance the fusion of the two, as illustrated in Figure 3. The process can be expressed as follows, Equation (11):

V_{s}^{″}, V_{g}^{″} = Fusion_Layer (V_{s}^{'}, V_{g}^{'})

(11)

3.5. Decoder Module

3.5.1. Cross-Attention Mechanism

In ref. [41], three methods were designed to combine two visual features. The simplest method involves concatenating the two visual features and using the resulting features as the key and value in a standard multi-head attention sublayer, with the word serving as the query. However, this method is prone to information confusion. Another method uses sequence cross-attention to perform cross-attention calculations between the two visual features. The corresponding design involves placing two independent multi-head attention sublayers sequentially, one for grid features and the other for region features (or vice versa), but this method introduces sequence bias. The third method performs multi-head attention calculations in parallel on the two visual features, using two multi-head attention mechanisms with independent learnable parameters. As shown in Figure 4, this yields the best results. Therefore, we incorporate the cross-attention mechanism into the decoder module for fusing gird features and segmentation features. The decoder consists of Ld layers of the transformer, with a third sublayer added to each layer of the MHA and the FFN, using the outputs of the encoder and the MHA as inputs. Specifically, the decoder is guided by the semantic features

S^{l}

, while the cross-attention module is employed to enhance the fused segmentation and grid features.

Given the actual sentence x, it is first segmented and mapped to a vocabulary list using word embeddings. Then, the masked multi-head attention sublayer is applied to extract high-level semantic features:

S^{l} = Masked_MHA (x, x, x)

. After obtaining the output of the encoder, the decoder with the cross-attention module is used to generate the description, as illustrated in Figure 4.

The cross-attention mechanism is handled as follows:

c_{t}^{g} = Sigmoid (W^{g} [a_{t}^{g}; S^{l}] + b^{g})

(12)

c_{t}^{s} = Sigmoid (W^{s} [a_{t}^{s}; S^{l}] + b^{s})

(13)

A^{l} = LN (c_{g}^{t} \otimes a_{t}^{g} + c_{s}^{t} \otimes a_{t}^{s} + S^{l})

(14)

where

a_{t}^{s}

and

a_{t}^{g}

are attention features,

c_{t}^{s}

and

c_{t}^{g}

are normalised probabilities, and ⊗ is the dot product operation.

3.5.2. Decoding Networks

As shown on the right side of Figure 1, the decoding network has three inputs: word-embedded descriptive text, grid visual features, and segmented visual features. The word-embedded features are processed in the mask self-attention layer to obtain the corresponding semantic features. Then, the text semantic features are combined with the grid visual features and the segmented visual features through the cross-attention mechanism. The process is as follows, Equation (15):

A^{l} = Cross_Attention (S^{l}, V_{g}^{″}, V_{s}^{″})

(15)

where

A^{l}

contains the integrated visual features that are adaptively generated based on the semantic features provided by the linguistic decoder. Subsequently, the interaction between visual and semantic features is further explored by non-linearly mapping and transforming the input features through a combination of two fully connected layers and activation functions.

H^{l} = LN (A^{l} + PWFF (A^{l})

(16)

Finally, the final word probability distribution O is obtained by feeding the output

H^{l}

of the final decoder layer L into a linear layer followed by a softmax function.

4. Experimental Results and Analyses

4.1. Dataset

Currently, there is a scarcity of resources for semantic segmentation and image description in maritime environments, and existing datasets for model training in this domain are limited. Moreover, the manual annotation of such datasets is tedious and labor-intensive. Although Chinese is one of the most widely spoken languages globally, image description datasets in Chinese are even more scarce. Unlike English, Chinese words lack segmentation characters, which increases the complexity of image description tasks in this language.

Herein, we propose the construction of two datasets for maritime imagery: the Maritime Semantic Segmentation dataset (MSS) and the Maritime Image Caption dataset (MIC). We obtained sailing videos of real sailing scenes by installing shipboard cameras on ships sailing in the Yantai and Dalian ports. Subsequently, we manually screened out the original videos with higher clarity and less jitter, then used image processing techniques such as video framing and cropping to obtain the original images from the selected videos. The MSS dataset comprises seven categories and includes 2405 images of real-world voyages, as shown in Figure 5. The MIC dataset comprises 890 images, each accompanied by five Chinese descriptive statements, as shown in Figure 6. We used a randomized splitting method to divide these two datasets into training, validation, and testing sets in the ratio of 8:1:1. The MSS dataset was trained using the SegFormer model to generate training files, which were subsequently used for extracting segmentation features.

The construction of these two datasets is of crucial importance to this study, as they provide valuable data resources for training CNN models. They enhance the model’s ability to accurately understand and learn the features and descriptive requirements of maritime images. In addition, these datasets are expected to serve as benchmark data for future research in maritime image description and to further support developments in this field.

4.2. Experimental Environment and Evaluation Indicators

The experiments were conducted on an Ubuntu 24.04.1 64-bit system using the PyTorch deep learning framework for training and testing. The hardware setup included a Hygon C86 3350 8-core processor and an Nvidia RTX 4090 GPU with 24 GB of video memory. The Chinese descriptive statements in the MIC dataset were segmented using the jieba toolkit, resulting in a vocabulary of 97 words. The model architecture was configured with three encoder layers, with two layers in the feature fusion layer, and three decoder layers. In our model, we set the dimensionality d of each layer to 512, the number of heads to 8, and the number of memory vectors to 40. We employed dropout with keep probability 0.9 after each attention and feed-forward layer. During the model training phase, this experiment utilized the Adam optimization algorithm and warmup learning rate preheating technique to optimize the proposed model, with the preheating step set to 1000. Subsequently, the SCST self-evaluation reinforcement learning strategy was used to continue iteratively training the model 30 times, with the learning rate set to

5 \times 10^{- 6}

in accordance with previous work [23]. In the model testing phase, this experiment adopted a search strategy with the width set to 5.

L_{X E} (θ) = - \sum_{t = 1}^{T} \log (p_{θ} (y_{t}^{*} ∣ y_{1 : t - 1}^{*}))

(17)

Then, following previous work [22], the CIDEr score was adopted as the reward signal, as it demonstrates strong correlation with human judgment [42]. The reward function is formulated based on the CIDEr metric and is expressed as follows, Equation (18):

\nabla_{θ} L (θ) = - \frac{1}{k} \sum_{i = 1}^{k} ((r (ω^{i} - b) \nabla_{θ} \log p (ω^{i})))

(18)

where i is the sentence number,

r (\cdot)

is the reward function, and

b = (\sum_{i} r (ω^{i})) / k

is the baseline computed as the mean of the rewards obtained by the sampled sequences.

In order to make a reasonable assessment of the effectiveness and sophistication of the algorithmic models in this paper, the experiments were conducted using four objective quantitative scoring methods that are widely used in image captioning: bilingual evaluation understudy 4-gram (BLEU-4) [43], consensus-based image description evaluation (CIDEr) [42], metric for evaluation of translation with explicit ordering (METEOR) [44], and recall-oriented understudy for gisting evaluation longest common subsequence (ROUGE) [45].

4.3. Comparison and Analysis of Experimental Results

4.3.1. Quantitative Analysis

In order to verify the validity of the image description models in this paper, we have designed seven comparative models, and all the experimental results have been shown in Table 1. (1) GoogleNIC: Google’s NIC model. (2) M2: Meshed-memory transformer model. (3) DFT: Deep fusion transformer for image captioning model. (4) Transformer: Transformer model. (5) TransformerS: A Transformer image captioning model that fuses dual information streams based on a cross-attention mechanism. (6) TransformerS1: Compared to our model, TransformerS1 uses a single feature fusion layer in the encoder. (7) TransformerS3: Compared to our model, TransformerS3 uses feature fusion layers throughout the encoder. All of the above models were retrained under the MIC dataset.

With the same number of transformer layers and identical training settings, ablation experiments were conducted on the TDSI model, as shown in Table 1 and Table 2. In the comparison between the Transformer and TransformerS models, it is evident that the addition of segmentation features leads to performance improvements across all metrics of the TransformerS model—BLEU, METEOR, ROUGE, and CIDEr. Among these, the CIDEr score improves by 0.128, indicating that the semantic information embedded in segmentation features can effectively compensate for the limitations of visual information, which is consistent with the findings reported in the literature [41].

Second, A comparison between the TDSI model and the TransformerS model reveals that, except for the BLEU-4 metric, all other evaluation metrics show varying degrees of improvement in the TDSI model. Notably, the CIDEr score increases by 0.081. In the experiments with the TransformerS1 and TransformerS3, in terms of the CIDEr metrics, our model is improved, respectively, by 0.02 and 0.57, and the other metrics, except METEOR, are also improved to varying degrees, which suggests that our proposed multimodal fusion approach can effectively promote the interaction between features. When combined with the cross-attention mechanism, this approach for the TDSI model allows for the sufficient fusion of segmentation and gird features, significantly enhancing model performance.

Furthermore, we explored the effect of the number of fusion layers on the model performance, as shown in Table 2. It can be observed that, as the number of feature fusion layers increases, the number of parameters decreases gradually. When the number of feature fusion layers is two, the number of parameters in our model is 30.41M, and it can be seen that the best CIDEr metrics are achieved in our model. When the number of feature fusion layers continues to increase, the number of parameters continues to decrease, but the performance of the model metrics also shows a decreasing trend. These results confirm that the integration of the feature fusion module and the cross-attention mechanism achieves superior performance.

Finally, in order to estimate the model results more accurately, we used the k-fold cross-validation method to prevent data segmentation bias. We randomly divided the dataset into five parts and selected one of them to be randomly divided into two parts as the validation set and test set, respectively, and the remaining four parts as the training set for retraining. The cross-validation was repeated five times, and the results obtained are shown in Table 3. The sixth line “mean” indicates the average value of each indicator, and the seventh line “standard deviation” indicates the standard deviation. From the table, we can see that, in the cross-validation experiment results, our standard deviation values are lower overall, and the relative standard deviation (the ratio of standard deviation to mean) of all indicators is lower than 2.5%, which indicates that the model has good stability and consistency in different data divisions. This also shows that our experimental results in Table 1 are within the ideal interval.

In the comparison test, we designed the comparative evaluation of the algorithm proposed in this paper with Google neural image caption (Google NIC) [12] and the M2 transformer [23] algorithm, as well as the DFT [28], as shown in Table 4. It can be seen that the CIDEr score of the proposed algorithm reaches 3.343, BLEU-4 reaches 0.788, METEOR reaches 0.575, and ROUGE reaches 0.897. In terms of the CIDEr metric, the TDSI algorithm CIDEr value is 0.029 higher than the DFT algorithm, 0.031 higher than the M2 algorithm, and 1.262 higher than the NIC. Under the same dataset and training conditions, the algorithm presented in this paper achieves the highest performance scores across all evaluation metrics. These results demonstrate that the proposed TDSI model is effective in using segmented features to supplement visual information for generating image descriptions. Furthermore, the feature fusion module employed in this study can efficiently perform feature integration and enhance the overall performance of the model.

4.3.2. Qualitative Analysis

Table 5 shows some image description results of Transformer and TDSI on the MSS dataset, including both successful and failed cases. In the first image description, the TDSI model’s description is clearly closer to the actual value than the Transformer model’s description. In the second image, although both models’ descriptions contain some errors, the TDSI model is able to describe “buoys,” while the Transformer model’s description is less relevant. In the third image, the TDSI model generated a correct description, while the Transformer model generated an incorrect description. From the above partial image description results, it can be seen that the relevance and accuracy of the description generated by the TDSI model is still higher than that of the Transformer model, but it is worth noting that the accuracy of the description generated by the TDSI model will be affected when the pixel percentage of some objects in the whole image is too low. These findings confirm that the TDSI model has improved the accuracy and completeness of the descriptions it generates. Overall, compared to the descriptions generated by the standard Transformer model, the TDSI model in this paper generates more accurate and descriptive sentences by integrating segmentation features and grid features and enhancing the visual representation of the decoder.

5. Conclusions

To address the limitations of visual area features in existing image captioning models and the lack of related research in the maritime domain, this study proposes a TDSI image captioning model that incorporates a semantic segmentation network. The proposed method integrates grid features with segmentation features, treating the latter as a secondary source of visual information. The proposed method enriches the extracted visual representations and enhances the overall performance of caption generation. Qualitative and quantitative experiments conducted on the MIC dataset demonstrate that the proposed method improves model performance and offers strong interpretability during the caption generation process. This study contributes to closing the research gap in maritime image captioning and provides more interpretable environmental cognition for intelligent ship perception systems. These contributions are of both theoretical and practical significance, particularly for enhancing maritime navigation safety and advancing the intelligent development of shipping technologies. However, similar to most data-driven image captioning models, the performance of the proposed method is influenced by the size and diversity of the training dataset. Although this study introduces a new dataset for maritime image captioning, its limited size and generalizability pose challenges. Future work should focus on expanding the dataset in terms of scale and scene diversity, including images from different regions, time of day, and weather conditions. Such efforts will enable the model to better adapt to a wider range of real-world maritime scenarios and support the practical deployment of image captioning in traffic-related applications.

Author Contributions

Z.Z., conceptualization, methodology, data curation, validation, writing—original draft, writing—review and editing; H.S., investigation, project administration, data curation, methodology, writing—review and editing; M.W., investigation, data curation, supervision; Y.W., conceptualization, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the 2023 Liaoning Provincial Science and Technology Plan (Key) Project (No. 2023JH1/10400047) by the Liaoning Provincial Department of science and technology, the 2023 Dalian Science and Technology Talent Innovation Support Policy Project (No. 2023RY005) by the Dalian Bureau of Science and Technology, and Guangxi Key Research and Development Plan (Grant No. GUIKE AA23062052-03).

Data Availability Statement

The dataset is available at: https://github.com/0xzzq666/MIC-and-MSS (accessed on 19 June 2025).

Conflicts of Interest

Author Yufei Wang was employed by the company Zhilong (Dalian) Marine Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

China Classification Society. Rules for Intelligent Ships. China Classification Society. 2025. Available online: https://www.ccs.org.cn/ccswzen/specialDetail?id=202503310327204112 (accessed on 19 June 2025).
Chen, X.; Wei, C.; Xin, Z.; Zhao, J.; Xian, J. Ship detection under low-visibility weather interference via an ensemble generative adversarial network. J. Mar. Sci. Eng. 2023, 11, 2065. [Google Scholar] [CrossRef]
Chen, X.; Wu, S.; Shi, C.; Huang, Y.; Yang, Y.; Ke, R.; Zhao, J. Sensing data supported traffic flow prediction via denoising schemes and ANN: A comparison. IEEE Sens. J. 2020, 20, 14317–14328. [Google Scholar] [CrossRef]
Liu, X.; Qiu, L.; Fang, Y.; Wang, K.; Li, Y.; Rodríguez, J. Event-Driven Based Reinforcement Learning Predictive Controller Design for Three-Phase NPC Converters Using Online Approximators. IEEE Trans. Power Electron. 2024, 40, 4914–4926. [Google Scholar] [CrossRef]
Farhadi, A.; Hejrati, M.; Sadeghi, M.A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every picture tells a story: Generating sentences from images. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part IV 11; Springer: Berlin/Heidelberg, Germany, 2010; pp. 15–29. [Google Scholar]
Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2891–2903. [Google Scholar] [CrossRef]
Kuznetsova, P.; Ordonez, V.; Berg, A.; Berg, T.; Choi, Y. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju, Korea, 8–14 July 2012; pp. 359–368. [Google Scholar]
Ordonez, V.; Kulkarni, G.; Berg, T. Im2text: Describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 2011, 24. [Google Scholar]
Socher, R.; Karpathy, A.; Le, Q.V.; Manning, C.D.; Ng, A.Y. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2014, 2, 207–218. [Google Scholar] [CrossRef]
Devlin, J.; Cheng, H.; Fang, H.; Gupta, S.; Deng, L.; He, X.; Zweig, G.; Mitchell, M. Language models for image captioning: The quirks and what works. arXiv 2015, arXiv:1505.01809. [Google Scholar]
Regression, N. An Introduction to Kernel and Nearest-Neighbor. Am. Stat. 1992, 46, 175–185. [Google Scholar]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 652–663. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 375–383. [Google Scholar]
Liu, S.; Zhu, Z.; Ye, N.; Guadarrama, S.; Murphy, K. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 873–881. [Google Scholar]
Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10578–10587. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Pan, Y.; Yao, T.; Li, Y.; Mei, T. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10971–10980. [Google Scholar]
Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; Ji, R. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 1655–1663. [Google Scholar] [CrossRef]
Li, Y.; Pan, Y.; Yao, T.; Mei, T. Comprehending and ordering semantics for image captioning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17990–17999. [Google Scholar]
Zhang, J.; Xie, Y.; Ding, W.; Wang, Z. Cross on cross attention: Deep fusion transformer for image captioning. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4257–4268. [Google Scholar] [CrossRef]
Shao, Z.; Han, J.; Marnerides, D.; Debattista, K. Region-object relation-aware dense captioning via transformer. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4184–4195. [Google Scholar] [CrossRef]
Nguyen, T.; Gadre, S.Y.; Ilharco, G.; Oh, S.; Schmidt, L. Improving multimodal datasets with image captioning. Adv. Neural Inf. Process. Syst. 2023, 36, 22047–22069. [Google Scholar]
Shao, Z.; Han, J.; Debattista, K.; Pang, Y. Textual context-aware dense captioning with diverse words. IEEE Trans. Multimed. 2023, 25, 8753–8766. [Google Scholar] [CrossRef]
Shao, Z.; Han, J.; Debattista, K.; Pang, Y. DCMSTRD: End-to-end dense captioning via multi-scale transformer decoding. IEEE Trans. Multimed. 2024, 26, 7581–7593. [Google Scholar] [CrossRef]
Subedi, B.; Bal, B.K. CNN-transformer based encoder-decoder model for Nepali image captioning. In Proceedings of the 19th International Conference on Natural Language Processing (ICON), New Delhi, India, 15–18 December 2022; pp. 86–91. [Google Scholar]
Solomon, R.; Abebe, M. Amharic language image captions generation using hybridized attention-based deep neural networks. Appl. Comput. Intell. Soft Comput. 2023, 2023, 9397325. [Google Scholar] [CrossRef]
Chethas, K.; Ankita, V.; Apoorva, B.; Sushma, H.; Jayashree, R. Image Caption Generation in Kannada using Deep Learning Frameworks. In Proceedings of the 2023 International Conference on Advances in Electronics, Communication, Computing and Intelligent Information Systems (ICAECIS), Bangalore, India, 19–21 April 2023; pp. 486–491. [Google Scholar]
Jiang, D.; Song, G.; Wu, X.; Zhang, R.; Shen, D.; Zong, Z.; Liu, Y.; Li, H. Comat: Aligning text-to-image diffusion model with image-to-text concept matching. Adv. Neural Inf. Process. Syst. 2024, 37, 76177–76209. [Google Scholar]
Daneshfar, F.; Bartani, A.; Lotfi, P. Image captioning by diffusion models: A survey. Eng. Appl. Artif. Intell. 2024, 138, 109288. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Wang, Y.; Sun, F.; Lu, M.; Yao, A. Learning deep multimodal feature representation with asymmetric multi-layer fusion. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3902–3910. [Google Scholar]
Nguyen, V.Q.; Suganuma, M.; Okatani, T. Grit: Faster and better image captioning transformer using dual visual features. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 167–184. [Google Scholar]
Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]

Figure 1. Structure of the TDSI model.

Figure 2. Transformer layer.

Figure 3. Fusion module.

Figure 4. Cross-attention mechanism.

Figure 5. MSS dataset.

Figure 6. MIC dataset.

Table 1. Results of ablation experiments.

Arithmetic	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE	CIDEr
Transformer	0.888	0.839	0.794	0.748	0.548	0.879	3.134
TransformerS	0.886	0.842	0.805	0.766	0.558	0.887	3.262
TransformerS1	0.884	0.835	0.794	0.757	0.591	0.896	3.323
TransformerS3	0.893	0.852	0.813	0.777	0.593	0.896	3.286
Ours	0.895	0.854	0.821	0.788	0.575	0.897	3.343

Table 2. Computational complexity.

Arithmetic	Seg Feature	Number of Fusion Layers	FLOPs	Params	CIDEr
Transformer	No	0	0.84G	22.51M	3.134
TransformerS	Yes	0	1.50G	36.71M	3.262
TransformerS1	Yes	1	1.50G	33.56M	3.323
TransformerS3	Yes	3	1.50G	27.25M	3.286
Ours	Yes	2	1.50G	30.41M	3.343

Table 3. Cross validation results.

Number of Groups	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE	CIDEr
1	0.916	0.877	0.845	0.809	0.597	0.908	3.410
2	0.887	0.842	0.800	0.762	0.588	0.901	3.287
3	0.917	0.873	0.838	0.800	0.578	0.899	3.374
4	0.900	0.858	0.820	0.777	0.584	0.901	3.325
5	0.916	0.873	0.830	0.786	0.598	0.919	3.407
Mean	0.907	0.864	0.827	0.787	0.589	0.905	3.361
SD	0.0133	0.0146	0.0175	0.0186	0.0085	0.0082	0.0535

Table 4. Comparative experimental results.

Arithmetic	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE	CIDEr
GoogleNIC	0.725	0.638	0.568	0.503	0.456	0.764	2.078
M2	0.877	0.838	0.803	0.765	0.542	0.882	3.312
DFT	0.882	0.843	0.808	0.770	0.546	0.882	3.314
Ours	0.895	0.854	0.821	0.788	0.575	0.897	3.343

Table 5. Qualitative comparison of generated captions.

Image	Captions
	TDSI: The sea ahead is very wide, with a city in the distance Transformer: The sea ahead is very wide, with islands in the distance. GT1: The sea ahead is very wide, with a city and a dock in the distance GT2: The sea ahead is particularly wide, with a city and a dock in the distance. GT3: The sea ahead is particularly wide, with a city and a port in the distance
	TDSI: The sea ahead has buoys, with the city in the distance Transformer: The port ahead has several ships docked along the shore GT1: The sea ahead has two buoys, with the city in the distance GT2: The sea ahead has two buoys floating, with the city’s coastline in the distance GT3: There are two buoys on the sea ahead, and in the distance is the city’s coastline
	TDSI: There are multiple ships and buoys on the sea ahead Transformer: The city’s port is ahead. GT1: There are multiple ships and buoys on the sea ahead GT2: There are ships and buoys on the sea ahead GT3: There are multiple ships and buoys ahead

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Z.; Shen, H.; Wang, M.; Wang, Y. A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow. J. Mar. Sci. Eng. 2025, 13, 1204. https://doi.org/10.3390/jmse13071204

AMA Style

Zhao Z, Shen H, Wang M, Wang Y. A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow. Journal of Marine Science and Engineering. 2025; 13(7):1204. https://doi.org/10.3390/jmse13071204

Chicago/Turabian Style

Zhao, Zhenqiang, Helong Shen, Meng Wang, and Yufei Wang. 2025. "A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow" Journal of Marine Science and Engineering 13, no. 7: 1204. https://doi.org/10.3390/jmse13071204

APA Style

Zhao, Z., Shen, H., Wang, M., & Wang, Y. (2025). A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow. Journal of Marine Science and Engineering, 13(7), 1204. https://doi.org/10.3390/jmse13071204

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow

Abstract

1. Introduction

2. Related Work

3. Research Methodology

3.1. Transformer Architecture

3.1.1. Multi-Head Attention Mechanism

3.1.2. Position-Wise Feed-Forward Networks

3.2. Grid Feature Extraction Network

3.3. Segmentation Feature Extraction Network

3.4. Encoder Module

3.4.1. Encoding Network

3.4.2. Fusion Module

3.5. Decoder Module

3.5.1. Cross-Attention Mechanism

3.5.2. Decoding Networks

4. Experimental Results and Analyses

4.1. Dataset

4.2. Experimental Environment and Evaluation Indicators

4.3. Comparison and Analysis of Experimental Results

4.3.1. Quantitative Analysis

4.3.2. Qualitative Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI