An Improved SAR Ship Classification Method Using Text-to-Image Generation-Based Data Augmentation and Squeeze and Excitation

: Synthetic aperture radar (SAR) plays a crucial role in maritime surveillance due to its capability for all-weather, all-day operation. However, SAR ship recognition faces challenges, primarily due to the imbalance and inadequacy of ship samples in publicly available datasets, along with the presence of numerous outliers. To address these issues, this paper proposes a SAR ship classification method based on text-generated images to tackle dataset imbalance. Firstly, an image generation module is introduced to augment SAR ship data. This method generates images from textual descriptions to overcome the problem of insufficient samples and the imbalance between ship categories. Secondly, given the limited information content in the black background of SAR ship images, the Tokens-to-Token Vision Transformer (T2T-ViT) is employed as the backbone network. This approach effectively combines local information on the basis of global modeling, facilitating the extraction of features from SAR images. Finally, a Squeeze-and-Excitation (SE) model is incorporated into the backbone network to enhance the network’s focus on essential features, thereby improving the model’s generalization ability. To assess the model’s effectiveness, extensive experiments were conducted on the OpenSARShip2.0 and FUSAR-Ship datasets. The performance evaluation results indicate that the proposed method achieves higher classification accuracy in the context of imbalanced datasets compared to eight existing methods.


Introduction
Synthetic aperture radar (SAR) is a radar technology that utilizes microwave signals to produce images from objects that are on the Earth's surface [1].By installing radar equipment on platforms, such as aircraft or satellites, and leveraging the motion of the platform along with the radar's transmit/receive capabilities, SAR technology can synthesize and process a series of radar echo signals to obtain information on surface reflectivity and high-resolution terrain images.Datasets generated through SAR imaging consist of high-resolution radar images.These images reveal fine features and provide information on the position, shape, size, and orientation of surface objects.In contrast to optical and hyperspectral imaging, SAR imaging operates continuously under all weather conditions, is not affected by environmental factors, and exhibits strong sensitivity to the geometric and physical properties of targets.Although SAR images differ significantly from the objects as perceived by the human eye, the advent of machine learning (ML) has successfully addressed this challenge, achieving impressive results by employing ML techniques to process SAR images [2].
Since the launch of the first SAR ocean remote sensing satellite, SEASAT, by the United States back in 1978 [3], research in the field of sea surface ship monitoring has been continuously thriving.Over the years, there have been numerous mathematical approaches used in this field, such as those based on the generalized likelihood ratio [4], polarization decomposition [5], and visual saliency [6].While these classical algorithms have achieved good detection and performance recognition results in certain marine application scenarios, they rely on establishing mathematical models and manual feature extraction based on the operator's experience.This difficulty limits the applicability of these classical algorithms to the efficient and accurate monitoring of modern ships.
The classification of ships based on SAR images is a crucial area of study for marine operations.Its goal is to effectively and precisely differentiate between the various ship types so that decision-makers have access to accurate information so that they can make the right decisions.High-performance target recognition is increasingly being achieved using artificial intelligence [7][8][9][10][11].With higher accuracy, faster speed, and a more effective design process, deep learning (DL) is poised to become the mainstream in the future.
In recent years, there have been some studies on SAR ship classification.For example, in [12], He et al. proposed a multitask learning framework to better extract deep features from medium-resolution samples, extending the use of dense convolutional networks to SAR ship classification.Sun et al., in [13], addressed the lack of ship texture information in SAR images compared to optical images by introducing a novel DL-based ship classification network that takes advantage of the phenomenon of significant scattering points from certain regions of the ships.This provides a promising approach for the application of SAR images in DL.Shang et al., focusing on other challenging issues, such as scale variance, large aspect ratios, intra-class diversity, and inter-class similarity, presented a novel hierarchically designed network with a spherical space [14].However, due to objective conditions, acquiring high-quality measured SAR target sample images is costly, and their availability is very low.Additionally, SAR is sensitive to imaging parameters and target poses, highlighting the challenges of target classification in SAR images under the condition of limited samples.
Motivated by the above discussion and aiming to deal with the aforementioned issues, this paper proposes an innovative SAR ship classification model that integrates a novel data augmentation scheme for imbalanced datasets, a latent diffusion model (LDM) [15], and an improved Tokens-to-Token Vision Transformer (T2T-ViT) [16].In order to tackle the challenges of imbalanced training samples and data scarcity, an improved SAR ship image generation module based on the LDM is introduced.By incorporating a text-to-image generation model, new images are generated based on the input text description, addressing the issue of insufficient data samples and thereby enhancing the model's adaptability to imbalanced data.To deal with the problem of limited useful information from the usual black background of SAR ship images, we introduce a T2T-ViT classification model as our backbone network.Due to its unique Tokens-to-Token (T2T) module structure, this model can effectively utilize SAR training samples by combining local information on the basis of global modeling.Lastly, to suppress interference from irrelevant features, we employ the Squeeze-and-Excitation (SE) module to enhance the performance of the T2T-ViT backbone network [17], thus enabling the network to focus more on features crucial for tasks such as classification, thus strengthening the model's expressive power and generalization performance.Within this framework, the main contributions can be summarized as follows:

•
We introduce a new SAR ship image generation module based on an LDM, which generates category-specific images by taking textual descriptions as input, thereby addressing the deficiency in data samples.This novel approach prevents skewed classification and overfitting during model training.In this way, the generated images effectively capture the structure and detailed features of SAR ships, providing valuable support for the training of the classification model.

•
Recognizing that the Transformer model tends to neglect local information in SAR ship images and the presence of redundancy in its backbone network, we use T2T-ViT as the model's backbone network in order to achieve locality through the T2T module while simultaneously reducing computational complexity.It turns out that this novel approach effectively captures subtle variations and features in SAR ship images, thereby enhancing the overall performance.

•
In order to further improve the performance of T2T-ViT, we introduce the SE module.
The dynamic weight adjustment provided by the SE module enables the network to better focus on crucial features for the current task, facilitating the capture and utilization of relevant feature information.This mechanism strengthens the network's performance, making it more precise and reliable in handling SAR ship images.
The remaining sections of this paper are organized as follows.Section 2 reviews relevant work in the field of target recognition.Section 3 provides a brief introduction to the principles of the Transformer framework.In Section 4, the proposed method is presented.To evaluate the proposed method, experiments conducted on two SAR ship datasets are described in Section 5. Section 6 concludes the work presented in this paper.

Related Work
In this section, we will briefly review previously published key papers that have presented ship classification techniques using (i) traditional and (ii) modern deep learningbased methodologies.

Traditional Classification Methods
SAR ship target classification involves further image processing after detecting ship targets, aiming to identify the category of the detected ships.Gouaillier et al. applied Principal Component Analysis (PCA) to the feature extraction of ship targets [18].In particular, they established a covariance matrix for a set of ship outlines, diagonalized it, selected a subset of principal components corresponding to the highest eigenvalues in the ship's feature space, and trained it with ship side-view angles within a 60-degree range.Experimental results showed that the PCA-based ship classifier design exhibited good discriminative performance.Wang et al. proposed a peak detection algorithm based on two-dimensional Gaussian functions [19].This method accurately estimated the peak position, peak amplitude, and peak width of targets in simulated and measured SAR images, as it was verified by various experimental results.Ridha et al. conducted a detailed analysis of the electromagnetic scattering process of ship targets and employed a polarization decomposition method by using a permanently symmetric scatterer to describe the ship targets [20].However, this method showed poor performance in identifying moving targets.Margarit et al. introduced phase information into the extraction of the scattering center features of SAR ship targets, achieving the effective recognition of ship targets in motion and strong sea clutter backgrounds [21].Wang et al. introduced a novel approach to identifying ship targets in SAR images using the Active Appearance Model (AAM) [22].They showed that, by describing the shape and grayscale of the targets, the AAM can more accurately characterize SAR images.Furthermore, Wang extensively discussed the application of the AAM to SAR target recognition and validated the effectiveness of this method through ship target classification in airborne synthetic aperture radar images.Knapskog et al. achieved ship target recognition by comparing the ship target contours extracted from SAR images with the contours of constructed 3D models [23].Additionally, Chen et al. proposed a two-stage feature selection method [24] that could describe the shape and scale of ship targets in SAR images, incorporating both scattering information and intensity information.
In summary, it is clear that, although traditional SAR ship recognition methods have achieved satisfactory results in many applications, they have significant drawbacks, such as time-consuming manual feature design, complex mathematical approaches, and limited transferability.These disadvantages make the traditional classification methods inappropriate for state-of-the-art intelligent and automated recognition applications, for which deep learning-based methods are more appropriate and will be discussed next.

Deep Learning-Based Classification Methods
More than 25 years ago, Lecun et al. implemented the LeNet-5 model for the classification of different individuals, surpassing all other methods known at that time [25].This marked the first use of backpropagation for training convolutional neural networks.The next breakthrough occurred in 2012, when Krizhevsky et al. introduced the AlexNet model [26], which was proposed for computer-vision-related tasks by incorporating operations such as ReLU activation functions, Dropout regularization, and stacked pooling.In 2014, Simonyan et al. proposed the VGG model [27], which was similar to the AlexNet model, adopting a structure of convolutional regions followed by fully connected regions.The VGG module applies a compositional rule comprising multiple identical convolutional layers and subsequent max-pooling layers.These convolutional layers maintain a constant input size, while the pooling layers reduce it by half.In the same year, Lin et al. introduced the NIN model [28], which incorporated a nested network structure.Unlike traditional convolutional layers using linear filters and nonlinear activation functions, the NIN model combines MLP with convolution, replacing the conventional layers with a more intricate micro neural network structure.This new layer was termed "Mlpconv".Szegedy et al. introduced GoogLeNet [29], which absorbed the NIN concept and introduced the concept of the Inception module.In 2015, He et al. proposed the deep residual network ResNet [30], which achieved the residual learning of features through skip connections, demonstrating the potential of deep networks in feature extraction.It is noted that, since their inception, both ResNet and Inception methods have demonstrated strong advantages and great potential in image classification, establishing the superiority of deep structures.
Furthermore, there have been related research activities for constructing smaller and more efficient models.For example, in 2017, Google proposed MobileNetV1 [31], which used depth separable convolutions, composed of depthwise convolutions and pointwise convolutions, to replace standard convolutions.This approach has significantly reduced computational costs and parameters, creating a lightweight network suitable for mobile devices.MobileNetV1 introduced two hyperparameters to balance the computational load and accuracy.Then, Tan et al. proposed MnasNet [32], the backbone of an automatic portable neural architecture that employs reinforcement learning to construct mobile models.MnasNet incorporates core CNN principles, achieving an excellent trade-off between accuracy improvement and latency reduction.In fact, it performs remarkably well on mobile devices, using speed information to measure model speed directly and incorporating it into the primary reward function of the search algorithm.Similarly, Wang et al. proposed HRNet [33], which can maintain high-resolution representations by parallelly connecting high-resolution and low-resolution convolutions.The approach enhances high-resolution representations via repeated multi-scale fusion in parallel convolutions, demonstrating exceptional performance across various multi-vision tasks.Lite-HRNet [34], introduced by Yu et al. in 2021, presented an improvement by incorporating efficient random blocks into HRNet.It leverages a lightweight unit called conditional channel weighting to replace pointwise convolutions within the random block, resulting in accelerated recognition speed.Nevertheless, deep learning models and hybrid methods for computer vision tasks still face significant challenges.Ongoing research continues to explore image classification with the goal of addressing these issues and strives to raise its upper limit.

Vision Transformer (ViT)
The introduction of the Transformer model marked a major breakthrough in the field of natural language processing (NLP).In particular, the use of the self-attention mechanism enabled the model to better understand long-distance dependencies and improve its ability to understand context [35].In 2020, Dosovitskiy proposed the first Vision Transformer (ViT) model [36], consisting of three components: the token generator, ViT encoder, and classifier.
Figure 1 presents a structural comparison between ViT and CNN.While classical CNNs rely on stacked convolutional layers to extract deep features, ViT takes a different approach by considering global information in the image along with the spatial distribution of objects.In ViT, the input image is divided into patches or tokens.Each token's position information is linearly embedded, and a new token called the Class token is introduced to represent the entire scene.The token sequence is then passed through the ViT encoder, which employs a multi-head self-attention mechanism to capture interactions between tokens.Finally, the output Class token is processed through MLP layers for scene classification.By directly incorporating global information and leveraging self-attention, ViT aims to provide a comprehensive understanding of the image, offering an alternative perspective to that provided by traditional CNN-based methods.For an input image of size h × w × c, the image is initially divided into patches of size p × p × c.Consequently, a total of n image patches can be obtained in one image, where n = h × w/p × p. Simultaneously, a learnable Class token is added, resulting in a total of n + 1 patches to be processed.This Class token is used to interact with all patches, ultimately learning features for classification.Next, a flattening operation is applied to the generated image patches, transforming each p × p × c patch into a one-dimensional vector of size 1 × (p × p × c), and the n one-dimensional vectors are concatenated to form a twodimensional vector of size n × (p × p × c).Subsequently, a fully connected layer is used to reduce the dimensionality of the two-dimensional vector, yielding a two-dimensional feature a of size n × d.These weighted features are then concatenated to form a vector z of size n × d, and through a nonlinear transformation w, interactive features f of the same size as the input features are eventually obtained.Finally, from the interactive features obtained through the Transformer encoder, only the 1 × d feature representing the Class token is extracted for subsequent classification.A dimensionality reduction operation is further conducted using an MLP to obtain the number of classes.
Since the Transformer was originally designed for natural language processing tasks and has not been modified to deal with computer-vision-related tasks, it faces significant operational challenges compared to the CNN.For example, image data, being more complex than text data, require substantial computational resources.Thus, unlike the CNN, the Transformer must process a large number of image patches and perform complex computations, which requires high computational resources.Additionally, the ViT model's structure has certain limitations in extracting detailed features from images.It may struggle to capture fine-grained features such as subtle textures, edges, and shapes, making ViT not so appropriate for tasks that require fine-grained visual analysis.Furthermore, the performance of ViT models highly depends on the quality and diversity of the training dataset used.Therefore, we will present an approach to appropriately modify the Transformer structure to better accommodate the characteristics of image data and improve the performance of ViT models in tasks involving fine-grained visual analysis and others.

Contrastive Language-Image Pre-training (CLIP)
Contrastive Language-Image Pre-training (CLIP) [37] is a transferable multimodal model trained through contrastive learning using text as a supervisory signal.Unlike other contrastive learning methods in the computer vision domain, such as MoCo [38] and Simclr [39], CLIP's training data consist of text-image pairs.This unique training approach enables CLIP to identify the correlation between text and images.By pairing text descriptions with the corresponding images, CLIP learns how to embed representations for both text and images and measures the similarity between them by comparing their embedding vectors.Consequently, CLIP can achieve cross-modal transfer learning across various tasks and domains.
As illustrated in Figure 2  corresponding pairs are denoted as positive samples, whereas the non-corresponding textimage pairs (i.e., T 1 does not correspond to I 2 , T N does not correspond to I N−1 ) are denoted as negative samples.Thus, in total, there exist N positive samples and N 2 − N negative samples, which are used as positive and negative labels to train the Text Encoder and Image Encoder.
Finally, for any i, j ∈ [1, N], the cosine similarity between T i and I j is calculated to quantify the correspondence between the corresponding text and image.A larger cosine similarity indicates a stronger correspondence between I i and T j , and vice versa.Therefore, by training the parameters of the encoder, the goal is to increase the denormalized cosine measure of N positive samples and, at the same time, to decrease the denormalized cosine measure of N 2 − N negative samples.The objective is as follows: As depicted in Figure 2, this corresponds to maximizing the blue background along the diagonal and minimizing the other non-diagonal values.

Methods
The overall framework of the proposed method is illustrated in Figure 3.As a backbone network, T2T-ViT achieves good results without the need for a massive pre-training dataset.In cases of insufficient data samples, we employ an image generation module.In the Conditioning Module, text information is input and encoded, combined with the U-Net structure in the image generation module.This integration generates SAR ship images corresponding to the textual descriptions, serving as supplements.Subsequently, the data samples are input into the T2T-ViT network for training.Incorporating an SE attention mechanism into the backbone network enhances its focus on crucial features, optimizing overall performance.After processing through the T2T module, the input images are fed into the backbone network, ultimately yielding the classification results of the target.

Image Generation Module
The introduction of the image generation module is based on the implementation of the LDM, and the operational structure is illustrated in Figure 3b.Firstly, it is necessary to have a variational autoencoder model comprising an encoder ϵ and a decoder D. We input the image into the encoder for compression processing, converting it from the pixel space to feature vectors within the latent space.This latent representation vector has a lower dimensionality, abstracting high-frequency and imperceptible details through dimensionality reduction.Next, a diffusion operation is performed on the latent representation space.This process occurs over continuous time steps, introducing Gaussian noise and gradually reducing the level of noise.Lastly, the decoder is employed to reconstruct the latent representation back into the pixel space.Its purpose is to transform vectors from the latent space into the high-dimensional pixel space, producing high-quality images that closely match the source image.
The handling of vectors in the latent space is akin to the function of the fundamental diffusion model (DM) [40].The detailed operation of the DM is depicted in Figure 4.This model is based on a parameterized Markov chain and operates through two distinct procedures: the diffusion process and the denoising process.
The original input data are progressively mixed with Gaussian noise as part of the diffusion process, which will end after a predetermined number of iterations when these data become completely random.Gaussian noise is added at each stage of a diffusion process with T steps for the original input data z 0 ∼ q(z 0 ) in the following manner: where ψ t is the variance of the noise added in step t, which increases with each step, i.e., As the step size T grows, the input image gradually loses all its original information and is transformed into random noise labeled as z T .The diffusion process involves adding noise iteratively, with the output at each step denoted by z t .This characteristic allows the diffusion process to be represented by a Markov process, which can be mathematically expressed as follows: The denoising process occurs in a manner opposite to the previous process operation, during which we gradually remove noise from the data.If the function distribution q( z t−1 |z t ) can be obtained at each step of the denoising process, then the initial input image information can be extracted despite the presence of pure random noise z T ∼ N (0, I) by removing the noise repeatedly.Therefore, the denoising process can be considered a data generation process.In this process, the Gaussian distribution of each state is parameterized using neural networks and is correlated with the others in a Markov chain.This operation can be mathematically expressed as follows: where p θ ( z t−1 |z t ) = N (z t−1 ; µ θ (z t , t), Σ θ (z t , t)) is a parameterized Gaussian distribution, and p(z T ) ∼ N (z T ; 0, I).The core processing section of the DM, denoted by e θ (o, t), is set as a time-conditioned U-Net, which utilizes 2D convolutional layers to build the lower-level U-Net's ability, further focusing on the most relevant perceptual parts.The loss function, L DM , can be written as In contrast to traditional diffusion models, we optimize the processing of input feature vectors in the previous U-Net by introducing a mechanism called cross-attention [41], transforming it into a more flexible conditional image generator.This method has shown good performance in handling models based on attention mechanisms that learn multiple input patterns.In order to combine different types of modalities (such as images or text descriptions) with the image generation module, an encoder corresponding to the input modality y is added, which we refer to as τ θ .The encoder can convert various modalities of input information into a feature vector τ θ (y) ∈ R M×dτ , which is then used as an input into the U-Net to combine with the latent features being denoised through cross-attention.
Based on image-conditioned inputs, the conditions can be obtained using the following expression: where the optimization process involves jointly optimizing both τ θ and e θ .The structure of the Conditioning Module is illustrated in Figure 3d.This modulation mechanism offers versatility, as demonstrated in our study, where CLIP is utilized to generate images.

Tokens-to-Token Vision Transformers (T2T-ViT)
T2T-ViT [16] is a model for image processing that extracts image features and performs sequence modeling through two stages of processing.Its structure is illustrated in Figure 3c, and its operation will be described next.Initially, the image is divided into equally sized image blocks and encoded through a series of nested Transformer encoders to generate locally informed representations of the image.Subsequently, the obtained feature vectors containing local information are sent to the backbone network, resulting in a feature representation containing the overall information of the image.The model gradually converts the image into a token with an efficient backbone structure.
Its key component is the T2T module, which is purpose-built to capture and model the local structural information within the image.Additionally, this module facilitates a gradual reduction in the number of tokens as the image progresses.In this way, the T2T module can represent different regions and features of the image as relatively short token sequences, achieving the effective compression and expression of image information.The T2T module, as shown in Figure 5, performs two operations, namely, reconstruction and soft splitting.
The output token sequence, T i , is used as input into the T2T Transformer for processing, and T ′ i is obtained through the following detailed operations, as described by MSA represents layer-normalized multi-head self-attention, the function of which is to capture dependencies at different positions in the sequence through multi-head attention calculations.Furthermore, MLP is a layer-normalized multilayer perceptron, which is used to process feature representations at each location.Then, these symbols are reshaped in the spatial dimension to form the image I i : where Reshape rearranges tokens T ′ ∈ R l×c into the image I ∈ R h×w×c , where l is the length of T ′ , and h, w, c represent height, width, and the number of channels, respectively, satisfying l = h × w.
After obtaining the reconstructed image I i , the local structural information is modeled through soft splitting to reduce the number of tokens: The output tokens created during the current T2T process are then sent to the subsequent Transformer layer.In order to prevent information loss during token generation from reconstructed images, we adopt a strategy of segmenting SAR ship images into overlapping patches.This approach establishes prior knowledge by relating each patch to its neighboring patches, thereby promoting stronger correlations between tokens in close proximity.By connecting the tokens within each segmented patch together, local information can be effectively aggregated and is beneficial for subsequent processing.

Squeeze-and-Excitation (SE) Module
In ViT, the multi-head attention mechanism plays a vital role in the Transformer layer.This mechanism not only generates attention layer outputs with encoded representation information but also learns relationships between positions in the input sequence to better capture its intrinsic structure and semantic information.In the context of multi-head attention, the input sequence is first transformed into three distinct vectors: query, key, and value sequences.Subsequently, the similarity between each query vector and the key vectors is calculated, resulting in a weighting distribution for each query vector across all key vectors.Next, the obtained weight distribution is adapted to perform weighted averaging on the value vector, resulting in the output representation for each query vector.The multi-head attention repeats this procedure multiple times, each time using different projection matrices for queries, keys, and values to generate different attention subspaces.At the end, the final output is created by concatenating the results from each subspace and is used to perform subsequent operations.
The structure of the SE module is illustrated in Figure 6.In order to highlight the important features, the SE module has been added after the output of the multi-head attention mechanism.The SE module consists of squeeze and excitation components, reconstructing channel weights by modeling relationships between channels.Therefore, features related to channel regions become more prominent.
The essence of the squeeze operation is a pooling operation F sq (•), which compresses the input feature map M by pooling, converting the spatial information contained in it into channel information A ∈ R N .The calculation of the l th feature vector in A is as follows: where where m l is the l th feature vector corresponding to the input feature map.After obtaining the compressed feature information, an incentive function F ex (•) is used to extract the relationship B between channels, that is, the degree of attention to each channel, which can be computed as, where δ represents the ReLU activation function of the first fully connected layer, σ represents the sigmoid activation function of the second fully connected layer, and V 1 and V 2 represent the weight matrices of the fully connected layers.The final output of the SE module is obtained by rescaling A through the activation B and is given by where F scale (•) represents reconstruction functions, and represents the product of the obtained weights b l and the original feature map a l in the corresponding channels.During the task-learning process, the weights of channels related to the context are increased, enhancing the expressive power of features.

Experiments and Performance Evaluation Results
In the previous section, we propose an improved SAR ship classification method based on text-to-image generation and an SE module integrated with T2T-ViT.In this section, we will evaluate the performance of our proposed classification method against other conventional target classification techniques using two publicly available SAR ship datasets.We also present the results of ablation experiments and category expansion experiments to demonstrate the superiority of the proposed method.

Datasets and Settings
The OpenSARShip2.0 dataset is sourced from the Sentinel-1 satellite [42].All images in the dataset were obtained in Interferometric Wide (IW) mode, covering nearly all global land and coastal areas.A notable feature of this dataset is the generation of ship labels using information obtained from an automated recognition system, providing the data labels with higher reliability.OpenSARShip2.0comprises approximately 35,000 SAR ship images, including vessels from 14 categories, such as cargo ships, cruise ships, passenger ships, law enforcement vessels, and fishing boats.The resolution is 20 m × 20 m, with pixel sizes of 10 m × 10 m in the azimuth and range directions.
Three major categories with relatively abundant samples were initially selected from the OpenSARShip2.0dataset, namely, Cargo, Fishing, and Tug, for our experiments.Figure 7 presents sample images of these three classes of SAR ships.However, it turned out that these original datasets have some drawbacks.Firstly, there is a lack of uniformity in the image sizes, which can be cumbersome for application to deep learning networks.Secondly, there is a significant disparity in the number of data samples among different ship categories, with Cargo having nearly 20,000 samples, far exceeding the sample counts of other ship categories.Therefore, we preprocessed the selected samples of the three ship categories by standardizing the image size to 224 × 224 pixels.Addressing the issue of class imbalance in the SAR ship dataset, data augmentation was performed on the training dataset.Initially, we applied horizontal and vertical flips to enhance the diversity of data samples.Subsequently, we rotated the data samples by 90°, 180°, and 270°to simulate various angles of vessels in real scenarios.Lastly, we randomly translated ship sample images, with pixel translation values ranging from −5 to 5, to introduce spatial variations in the data samples.Following a series of data augmentation processes, the data samples were expanded to four times their original size, as shown in Table 1, where the number of training samples for the three ship categories based on the OpenSARShip2.0dataset can also be found.The FUSAR-Ship dataset offers a comprehensive collection of high-resolution ship images.It comprises 15 primary ship categories and 98 subclasses, and it encompasses various marine targets, including objects other than ships [43].The ship images in this dataset were obtained from China's GF-3 satellite, which features a civilian C-band spaceborne SAR system.This advanced technology enables the satellite to capture SAR images with a high azimuth resolution of 1.124 m × 1.728 m and full-polarization capabilities.The imaging mode is the Ultra-Fine Stripmap mode, covering various scenes, such as sea, land, coastlines, rivers, and islands.Due to the extremely limited sample number of some ship categories in the FUSAR-Ship dataset, we selected the following five ship categories to further validate the effectiveness of our model: Bulk Carrier, Cargo, Fishing, Tanker, and Other.Figure 8 displays sample images of the five classes of SAR ships in the FUSAR-Ship dataset.Similarly, to obtain a balanced dataset, a series of data augmentation processes were applied to the selected samples of the three ship categories to meet the basic requirements for training for the target classification task.The number of training samples for the five ship categories based on the FUSAR-Ship dataset is presented in Table 2.In our experiments, we trained the models with the same parameter settings.For both the OpenSARShip2.0and FUSAR-Ship datasets, the input image size was fixed at 224 × 224 pixels for all training instances.The Stochastic Gradient Descent optimizer was employed [44], utilizing a weight decay parameter of 0.005 and a momentum parameter of 0.9.The proposed network model underwent training for a total of 2000 iterations.Due to limited GPU memory, the batch size was set to an empirical value of 8 to improve training efficiency.To mitigate any potential issues associated with gradient vanishing during training, a relatively low learning rate of 0.0005 was chosen.By employing such a learning rate, we could better control the training update pace of the network, thereby aiding in achieving stable performance for our approach.

Performance Evaluation Indices
The classification results can be categorized into four types: true positive (TP), true negative (TN), false positive (FP), and false negative (FN).TP indicates instances where the model correctly predicts positive samples as positive, TN denotes cases where the model correctly predicts negative samples as negative, FP represents instances where the model erroneously predicts negative samples as positive, and FN signifies cases where the model erroneously predicts positive samples as negative.Similar to previous studies [45][46][47][48], Accuracy is selected as the main evaluation metric to gauge the model's performance in terms of classification to determine how effective the process suggested in this paper is.The percentage of correctly predicted samples out of the total is referred to as Accuracy, which is computed as follows: Furthermore, three additional performance metrics were used in the experiments for further result validation: Precision, Recall, and F1 Score.Precision measures the percentage of true positive samples among the samples predicted by the model as positive.It provides insight into the correctness of the positive predictions and can be calculated as Recall, also known as the True Positive Rate or Sensitivity, represents the percentage of true positive samples among the actual positive samples.It is a measure that focuses on capturing all positive instances and is related to the original samples, and it is computed as F1 Score is a comprehensive metric that takes into account both Precision and Recall, providing a balanced measure of the model's performance.It can be computed as In order to effectively evaluate the generated images, we employed the Structure Similarity Index Measure (SSIM) [49], which is a metric for assessing the similarity between two images, taking into account information pertaining to luminance, contrast, and structure.
To assess the similarity between the mean brightness of two images, we define a luminance contrast function as follows: where µ x and µ y represent the mean of the local blocks of images x and y.C 1 is a small constant used to stabilize the divisor, usually taking (k 1 • MAX) 2 , where MAX is the largest possible value of the pixel value, and k 1 is a small constant.
Considering the brightness variance and the covariance between the two images, the contrast function is defined by where σ x and σ y represent the standard deviation of the local block of images x and y, σ xy is the covariance between x and y, C 2 is a small constant used to stabilize the divisor, usually (k 2 • MAX) 2 , and k 2 is a small constant.The structural similarity is measured by the similarity between brightness and contrast, and we define the structural contrast function as follows: where σ x and σ y represent the standard deviation of the local block of images x and y, σ xy is the covariance between x and y, and C 2 is a small constant used to stabilize the divisor, usually Finally, the total SSIM is obtained by combining brightness similarity, contrast similarity, and structural similarity, and it is computed as follows: where α, β, and γ are used to adjust the importance between the three modules, usually taking a value of 1.

Image Generation Experiment
After incorporating the image generation module into the model, we expanded the imbalanced dataset of three selected categories from OpenSARShip2.0.Taking the existing samples of three ship categories as input, corresponding to the language texts SAR-Cargo, SAR-Fishing, and SAR-Tug, the model underwent iterative training.By providing textual information as input, the model generated corresponding SAR ship images.To verify the credibility of the generated images, the SSIMs between the generated images and the original dataset were calculated, and the average result obtained reached 0.837, which proves that we successfully captured the features of the SAR ship.Due to the limitation of training samples, generating SAR ship images is performed from a top-down perspective.Typical samples of various ship categories generated by the image generation module are shown in Figure 9. From the above performance evaluation results, it can be concluded that the images generated by the image generation module meet the standards as a dataset, exhibiting high resolution and clarity.This indicates that these generated images can be used to train models or conduct other relevant research.By leveraging these generated images, we can more accurately capture and classify the features of the target objects, effectively improving the precision and efficiency of model training.This is clearly very important for further optimizing and enhancing ship classification algorithms.For the three selected ship categories in the OpenSARShip2.0dataset, we successfully augmented the training set to 700 images using the generated image samples.This means that, in subsequent ship classification experiments, we have a more extensive and comprehensive dataset.By using such a dataset, we can more comprehensively evaluate and fine-tune our ship classification model to achieve even more accurate and reliable classification results.

Performance Comparison Results
To more comprehensively evaluate the effectiveness of the proposed method, we compared it with three traditional machine learning models and nine deep learning classification models on the OpenSARShip2.0and FUSAR-Ship datasets.In terms of the number of network layers in the proposed method, we chose 14 Transformer layers, as this selection is the most appropriate choice for the best balance between model accuracy and model parameters.The traditional machine learning models include SVM [50], Adaboost [51], and KNN [52].The deep learning models include LeNet [25], AlexNet [26], ResNet [30], MobileNet [31], DeiT [53], DenseNet [54], EfficientNet [55], Shufflenet [56], and CSPNet [57].The comparison results are shown in Table 3. From Table 3, it can be observed that our proposed method achieved higher classification performance compared to other classic deep learning target classification algorithms on the three categories of the OpenSARShip2.0dataset.It achieved an Accuracy of 74.46%, whereas the second-best algorithm, Shufflenet, achieved 73.41%.There were also improvements in other evaluation metrics.When conducting five classification experiments on the FUSAR-Ship dataset, which is not limited to the classification of three types of ships, the increase in categories made classification more difficult.However, our proposed method still achieved the best classification performance, with an accuracy of 72.19%, which is 0.83% higher than the suboptimal algorithm Shufflenet.The other three evaluation indicators also showed some improvement, with 0.76%, 0.91%, and 0.63%.The proposed method requires 53 s per epoch during training, which is mid-level compared to the other networks.When conducting image classification, the classification time for eight images in our proposed method is 51.37 ms, which is not significantly different from the inference time of most other models.In summary, in the face of imbalanced training samples, our proposed method demonstrated stronger capabilities by supplementing training samples through the image generation module and adjusting feature weights using the SE attention mechanism.Therefore, it is well suited for addressing data scarcity situations.

Performance Results
To verify the effectiveness of the image generation and SE modules in improving the performance, we conducted ablation experiments on three categories of the OpenSAR-Ship2.0 ship dataset.Specifically, we compared the results of experiments with and without these two modules and calculated their respective classification accuracies.The results of the ablation experiments are shown in Table 4, where "✓" indicates the addition of these two modules to the base model T2T-ViT.
From the performance results presented in Table 4, it can be observed that T2T-ViT, as a variant model of ViT, outperforms ViT by better utilizing information in the image through the use of the T2T module, even without extensive pre-training on large datasets.This is consistent with the imbalance and lack of training data images faced in the classification ratio and incorporated them into the improved T2T-ViT for training.The classification performance results for the four ship categories are presented in Table 6, where we used the basic model to train on the dataset with unbalanced samples.Due to the similarity between classes of SAR ship data, when training tankers with a small number of data samples, the basic model is unable to effectively extract features, resulting in the misdiagnosis of tankers as ships of other classes.The introduction of the image generation module not only successfully maintains the classification of the three basic ship categories but also improves the recognition rate of the fourth ship category, "Tanker".In the overall evaluation of the model, the most important evaluation indicator, Accuracy, reveals a recognition rate of 59.03% with the basic model due to the relatively small proportion of Tanker samples in the original data.The recognition rate of 60.26% with the improved model clearly shows that it has been effectively improved.Compared to training with only the original dataset, the application of the image generation module provides us with additional training samples, so this approach enables a more comprehensive SAR ship classification.The use of the image generation module effectively augments the dataset, providing the algorithm with richer sample information, thereby improving the network's generalization ability and classification accuracy.This not only showcases the effectiveness of the image generation module in expanding the dataset but also confirms the feasibility and practicality of our proposed method.By incorporating the image generation module, our approach adapts better to various ship categories and achieves satisfactory classification results.

Conclusions
In our paper, a novel SAR ship classification method is proposed to address the issue of inter-class sample imbalance in the SAR ship dataset.The improved approach utilizes text-to-image generation to mitigate the imbalance in the dataset, addressing the deficiency of insufficient data samples by introducing deep image modules into T2T-ViT in a text-to-image manner.Simultaneously, the SE model is introduced to enhance the network's focus on key features, thereby improving classification accuracy.Classification experiments were conducted on the OpenSARShip2.0dataset and the FUSAR-Ship dataset, demonstrating that our proposed method outperforms other algorithms in Precision, Recall, F1 Score, and Accuracy.Additionally, we conducted ablation experiments and extended ship classification experiments with four categories, further proving the effectiveness and stability of the proposed method.

Figure 1 .
Figure 1.A structural comparison diagram between CNN and ViT.
For input features, position encoding is added to indicate the relative position of each image block.Subsequently, the preprocessed features are fed into the Transformer encoder to obtain interactive features.The most crucial component here is the multi-head attention layer.Input features of size n × d are divided into m heads, resulting in m different features [a 1 , a 2 , . . ., a m ].For example, with K heads, a given input feature a of size n × d is split into K different features, i.e., [a 1 , a 2 , . . ., a K ].Subsequently, self-attention computation is performed on these K features, obtaining the corresponding weighted features [b 1 , b 2 , . . ., b K ].
, CLIP employs a Text Encoder and an Image Encoder.The former is employed to extract features from text and can use commonly available text Transformer models in NLP.The latter is responsible for extracting features from images and can use popular CNN models or ViT.The training process of CLIP on a text-image paired dataset can be described as follows.Firstly, if a batch in the dataset contains N text-image pairs, N texts are first encoded through the Text Encoder, assuming each text is encoded into a one-dimensional vector.The output of the Text Encoder for this batch of text data is denoted by [T 1 , T 2 , • • • , T N ].Similarly, the N images are encoded through the Image Encoder, assuming each image is encoded into a one-dimensional vector.The output of the Image Encoder for this batch of image data is denoted by [I 1 , I 2 , • • • , I N ].

Figure 2 .
Figure 2. The pre-training process of CLIP.Secondly, in the obtained [T 1 , T 2 , • • • , T N ] and [I 1 , I 2 , • • • , I N ], the text-image pairs have a one-to-one correspondence: i.e., T 1 corresponds to I 1 , T 2 corresponds to I 2 , etc.These N

Figure 3 .
Figure 3.The overall framework of the proposed method.

Figure 4 .
Figure 4.The operation of the diffusion model.

Figure 5 .
Figure 5.The structure of the T2T module.

Figure 6 .
Figure 6.The structure of the SE module.

Figure 7 .
Figure 7. SAR ship samples from the OpenSARShip2.0dataset representing (a) an optical image of Cargo; (b) a SAR image of Cargo; (c) an optical image of Fishing; (d) a SAR image of Fishing; (e) an optical image of Tug; and (f) a SAR image of Tug.

Figure 8 .
Figure 8. SAR ship samples in FUSAR-Ship dataset.Among these, (a-e) represent optical images of Bulk Carrier, Cargo, Fishing, Tanker, and Other, and (f-j) represent SAR images of Bulk Carrier, Cargo, Fishing, Tanker, and Other.

Figure 9 .
Figure 9. Image results generated by inputting text information: SAR Cargo, SAR Fishing, SAR Tug; (a-c) represent Cargo images, while (d-f) represent Fishing images, and (g-i) represent Tug images.

Table 1 .
Three ship categories' sample statistics from the FUSAR-Ship dataset.

Table 2 .
Five ship categories' sample statistics from the FUSAR-Ship dataset.

Table 3 .
A comparison of quantitative evaluation indicators on two datasets.For a clear display, the highest score in each column is highlighted in bold.Train time represents the time it takes for the model to train one epoch, and test time represents the time it takes for the model to classify 8 images at once.

Table 6 .
Classification results after adding a fourth category of ships from the OpenSAR-Ship2.0 dataset.