Next Article in Journal
Analytical Equations for the Prediction of the Failure Mode of Reinforced Concrete Beam–Column Joints Based on Interpretable Machine Learning and SHAP Values
Next Article in Special Issue
Multi-Person Localization Based on a Thermopile Array Sensor with Machine Learning and a Generative Data Model
Previous Article in Journal
Attention-Based Malware Detection Model by Visualizing Latent Features Through Dynamic Residual Kernel Network
Previous Article in Special Issue
OD-YOLO: Robust Small Object Detection Model in Remote Sensing Image with a Novel Multi-Scale Feature Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Combined CNN-LSTM Network for Ship Classification on SAR Images

ENSTA Bretagne, Lab-STICC, UMR CNRS 6285, 29806 Brest, France
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(24), 7954; https://doi.org/10.3390/s24247954
Submission received: 19 October 2024 / Revised: 29 November 2024 / Accepted: 6 December 2024 / Published: 12 December 2024

Abstract

Satellite SAR (synthetic aperture radar) imagery offers global coverage and all-weather recording capabilities, making it valuable for applications like remote sensing and maritime surveillance. However, its use in machine learning-based automatic target classification faces challenges, including the limited availability of SAR target training samples and the inherent constraints of SAR images, which provide less detailed features compared to natural images. These issues hinder the effective training of convolutional neural networks (CNNs) and complicate the transfer learning process due to the distinct imaging mechanisms of SAR and natural images. To address these challenges, we propose a shallow CNN architecture specifically designed to optimize performance on SAR datasets. Evaluations were performed on three datasets: FUSAR-Ship, OpenSARShip, and MSTAR. While the FUSAR-Ship and OpenSARShip datasets present difficulties due to their limited and imbalanced class distributions, MSTAR serves as a benchmark with balanced classes. To compare and optimize the proposed shallow architecture, we examine various properties of CNN components, such as the filter numbers and sizes in the convolution layers, to reduce redundancy, improve discrimination capability, and decrease network size and learning time. In the second phase of this paper, we combine the CNN with Long short-term memory (LSTM) networks to enhance SAR image classification. Comparative experiments with six state-of-the-art CNN architectures (VGG16, ResNet50, Xception, DenseNet121, EfficientNetB0, and MobileNetV2) demonstrate the superiority of the proposed approach, achieving competitive accuracy while significantly reducing training times and network complexity. This study underscores the potential of customized architectures to address SAR-specific challenges and enhance the efficiency of target classification.

1. Introduction

Synthetic aperture radar (SAR) technology has revolutionized maritime surveillance and object and vessel classification by providing high-resolution, all-weather imaging capabilities. SAR sensors have become instrumental in monitoring vast maritime areas and in ship detection, identification, and classification. In recent years, significant advancements have been made in the field of ship and object classification using SAR imagery, enhancing maritime security, search-and-rescue operations, marine transportation management, marine security situational awareness, environmental monitoring, and so on.
With the rapid development of satellite imagery, the need to analyze these images grows more every day. A promising solution is the use of artificial intelligence to extract and classify the different targets found in satellite images. In the case of vehicles, artificial intelligence (AI) can help recognize models and determine specific features such as the speed of movement or the dimensions of the targets. Multiple techniques and algorithms have been developed to find better solutions, leading to the use of deep neural networks. By adapting deep neural networks specialized in object detection and identification, we can now recognize specific ships and vehicles using satellite imagery.
Recently, with advances in computational power and the ability to parallelize calculations using GPUs, DL architectures as instances of ML algorithms have been further investigated. Unlike traditional machine learning methods, deep neural networks (DNNs) mimic the functioning of the human brain and are parametric, meaning the number of their parameters is independent of the size of the training dataset. This feature is crucial during the prediction phase, as it reduces processing time and enhances the applicability of these algorithms for real-time image processing. Additional interest in using DL schemes arises from their rapid implementation on FPGAs [1] and ASICs [2].
DL has made great progress in a variety of real-world problems, e.g., detection, recognition, identification, motion tracking, action recognition, prediction, and data denoising or dehazing. In this context, we can mention CNNs, the “Boltzmann family” including deep belief networks, deep Boltzmann machines, and stacked auto-encoders (for denoising). Furthermore, we can also refer to recurrent neural networks (RNNs), which are more adapted for signals processed over variable observation windows [3]. In RNNs, long short-term memory (LSTM) networks [4] are usually used to explore temporal aspects and correlation in sequential and multi-view data [5].
Nevertheless, DL still has some limitations in generalizing performances and optimization architecture components in real-world applications that need to be studied in depth by researchers. One challenge is the insufficient number of SAR ship training samples, which hinders the effective training of CNNs. Additionally, the limited information available in SAR images, compared to natural images like those in ImageNet, restricts the extraction of discriminative feature descriptors. To overcome the problem of insufficient and unbalanced data, one of the most suitable strategies is to pre-train models from large datasets and adapt these models by transfer learning and fine-tuning on the target dataset (i.e., a SAR images dataset) [6,7]. The second strategy deals with online and offline data augmentation [8,9,10] and investigation on semi-supervised and unsupervised learning methods to deal with limited labeled SAR data [11]. Techniques like self-training and clustering-based approaches aim to enhance classification performance with minimal labeled samples.
On the other hand, to extract the most discriminative representation features from SAR images for target objects (such as ships and vehicles), extensive research has been conducted on various aspects, including network architecture design and optimization [12], embedding attention mechanisms [13,14,15], feature and/or decision fusion [16,17], learning strategies [18], one-shot learning [16,19], and more. Hence, efforts have been made to improve the performance of SAR object classification. However, these usually require more complex network structures, higher-dimensional features, and more costly storage costs.
Furthermore, a limited SAR dataset proves to be insufficient for adequately learning numerous parameters within a complex CNN. This inadequacy leads to overfitting the features of the ship extracted by CNN exhibiting significant redundancy, directly compromising the models’ discriminative capabilities [20]. In [21], many studies have explored CNN architectures for ship classification, leveraging their ability to automatically learn features from SAR images. These architectures are often fine-tuned or adapted to suit the unique characteristics of SAR data, achieving improved classification accuracy.
The authors in [22] provide a comprehensive overview of recent advances in ship detection and classification using deep learning models, focusing on SAR imagery. They review various architectures, including CNNs and detection models, and discuss challenges and future directions in the field. In ship classification, the complex nature of SAR data and the variability of ship signatures pose significant challenges for traditional classification methods. Existing approaches often struggle to effectively capture both the spatial and temporal characteristics of SAR images without including the sequential and temporal information.
To address these limitations, this paper proposes a novel deep learning architecture that combines shallow convolutional neural networks (CNNs) and long short-term memory (LSTM) for ship classification on limited datasets. By leveraging the strengths of both CNNs and LSTMs, our model effectively captures both local spatial features and temporal dependencies within limited SAR images, leading to significant improvements in classification accuracy compared to state-of-the-art methods.
In this context, several literature works are presented combining LSTM networks with CNN frameworks to enhance target detection and recognition in remote sensing data. The authors in [23] introduce an architecture known as Multi-Stream CNN (MS-CNN) for automatic target recognition (ATR) in synthetic aperture radar (SAR) by utilizing SAR images from various perspectives. They specifically implement a multi-input architecture that combines information from different views of the same target from diverse angles, allowing the innovative multi-view design of MS-CNN to maximize the utility of limited SAR image data and boost recognition performance. The authors provide a comprehensive overview of the literature on LSTMs that are specifically applied to multi-view ATR methods. To reduce the influence of azimuth variation on SAR ATR and extract azimuth-robust features from SAR series, the authors in [24] propose a Conv-BiLSTM Prototypical Network (CBLPN), which uses as the feature extractor a convolutional bidirectional long short-term memory (Conv-BiLSTM) adapted for few training samples. For classification, the authors propose a classifier based on Euclidean distance for few training samples.
Target detection in maritime radar data often struggles with issues such as clutter and low signal-to-noise ratios. To overcome these limitations, the work in [25] proposes a novel CNN-LSTM architecture specifically designed for augmenting target detection in real maritime wide-area surveillance radar data.
Some recent works explore the fusion of SAR imagery with other modalities, such as optical imagery or AIS (automatic identification system) data, using deep learning techniques. These multi-modal fusion methods [10] enhance the classification performance by leveraging complementary information from different sources. New techniques are being developed to enhance the ability to differentiate between features or to improve the performance of networks. They include attention mechanisms such as RasNet architecture [13] and Transformer-based architectures [26]. These architectures have gained traction for ship classification from SAR imagery. These mechanisms allow the model to focus on relevant regions in the SAR image, improving its ability to capture intricate ship features and aiding in accurate classification. In the same context, addressing the variability of spatiotemporal resolutions in SAR images, the RSMamba method is proposed [27]. RsMamba proposes an innovative architecture for remote sensing image classification. This approach leverages the State Space Model (SSM) framework alongside the hardware-efficient Mamba design [28], effectively combining a global receptive field with linear modeling complexity to deliver both efficiency and accuracy in classification tasks.
Researchers are also investigating semi-supervised and unsupervised learning methods [29] to deal with limited labeled SAR data. Techniques like self-training and clustering-based approaches aim to enhance classification performance with minimal labeled samples.
In the literature, the hybridization of CNNs and LSTM has been proposed across various application domains. For trajectory prediction, CNN-LSTM hybrid models [30] have been widely applied to aircraft 4D trajectory modeling and human-driven vehicle path forecasting, demonstrating their efficiency in handling spatial–temporal complexities. In machine vision and fuel consumption, the authors in [31] propose CNN-LSTM frameworks adapted for tasks such as feature extraction and prediction, improving accuracy and robustness. A comprehensive overview for advancements in deep learning techniques and the combination of CNN-LSTM in maritime applications can be found in [32]. In the medical field, particularly for COVID-19 detection and analysis, the study [5] introduces a hybrid CNN-LSTM approach to classify COVID-19 cases using sequential and temporal chest X-ray (CXR) images.
Our main contributions can be summarized as follows:
  • Specific focus on single-view SAR imagery: Unlike other studies, our research targets the challenges of synthetic aperture radar (SAR) imagery, such as limited labeled datasets, imbalanced class distributions, and non-sequential images.
  • Proposed optimizations: We propose a shallow CNN combined with LSTM to reduce network complexity, minimize training time, and improve classification accuracy for SAR datasets. This contrasts with standard CNN-LSTM implementations, which often prioritize depth and complexity. Through a systematic evaluation of CNN components, such as the number and size of filters, we aim to optimize the model’s performance while minimizing computational cost and training time.
  • Comprehensive validation: We validated our architecture on three distinct SAR datasets (FUSAR-Ship, OpenSARShip, and MSTAR), showcasing its adaptability and competitive performance in handling datasets of varying size, balance, and difficulty.
As part of this paper, we present a brief concept, the dataset, and the theoretical description of the proposed architectures. Then, we summarize the implementation of the different algorithms and the choices and solutions taken to circumvent the obstacles encountered. Finally, we evaluate and compare the proposed architectures with the results of classical convolutional neural networks.

2. Convolutional Neural Networks (CNNs)

2.1. Description of CNN

The principle consists of extracting the relevant features in an automatic way and carrying out the classification or identification phase. In this paper, we are only interested in the classification task.
CNN architectures can be broadly classified as shallow or deep, each suited to different tasks and datasets.
In this paper, we are inspired in the first step by a shallow architecture introduced in [16,33]. This relatively simple CNN architecture is composed of two convolution layers, two max-pooling layers, and three FC layers.
There are also many complex architectures that are widely used in the field of optical images due to the abundance of annotated data in this domain. This allows DNNs applied to this domain to have very good performances. We can cite, for example, the following very deep architectures: VGG [34], ResNet [35], Xception [36], DenseNet [37], MobileNetV2 [38], EfficientNet [39], and Siamese and RasNet Neural Networks [13,16].

2.1.1. Description of CNN Architecture Adopted

In this section, the proposed CNN architecture is highlighted. The design aims to balance computational efficiency and overfitting prevention. The proposed architecture includes three convolutional (CONV) layers and three fully connected layers, n f c = 3. Input data comprise a tensor R R N R × N s × N s , where N R = 1 and N s = 128. The proposed CNN includes four steps in convolutional layers to extract features before classification. These steps comprise the following:
  • Zero-padding step: Ensures no information is lost at the borders during convolution. If ( Z 1 , Z 2 ) N 2 denotes the number of zeros added to the last two tensor dimensions, the zero-padding step constructs the tensor R pad = ( r t , s 1 , s 2 p a d ) R N R × ( N s + Z 1 ) × ( N s + Z 2 ) .
  • Convolutional step: Extracts features by applying filters to the input tensor. Each filter moves across the tensor with defined strides, producing an output that highlights key spatial patterns. The process is parameterized by the number, size, and strides of the filters, optimizing feature extraction for the classification task.
    If K denotes the number of filters, W k = ( w t , u 1 , u 2 k ) R N R × U 1 × U 2 represents the k th filter where ( U 1 , U 2 ) N * 2 is the size of all filters, and ( a 1 , a 2 ) N * 2 denotes the strides of filters along the last two dimensions, then the output of the convolutional step is mathematically given by
    r k , s 1 c , s 2 c c o n v = t = 1 T u 1 = 1 U 1 u 2 = 1 U 2 w t , u 1 , u 2 k r t , a 1 β c + u 1 , a 2 ω c + u 2 p a d ;
    where
    β c = s 1 c 1 ω c = s 2 c 1 s 1 c [ 1 , 2 , , N c o n v , 1 ] s 2 c [ 1 , 2 , , N c o n v , 2 ] k [ 1 , 2 , , K ] N c o n v , 1 = floor N s + Z 1 U 1 a 1 + 1 , a n d N c o n v , 2 = floor N s + Z 2 U 2 a 2 + 1
    Then, the resulting tensor is given by
    R conv = ( r k , s 1 c , s 2 c c o n v ) R K × N c o n v , 1 × N c o n v , 2
    An activation function is then applied to this tensor. The Rectified Linear Unit (ReLU) activation function is used.
  • Max-pooling step: Downsamples the feature maps by retaining the highest value within a defined window, reducing dimensionality while preserving the most significant features. This process enhances computational efficiency and focuses on dominant spatial patterns.
    If ( V 1 , V 2 ) N * 2 is the size of the max-pooling window and ( b 1 , b 2 ) N * 2 are its strides along the last two dimensions, respectively, the output of max-pooling applied on the activated CONV tensor
    R conv . act = ( r k , s 1 c , s 2 c c o n v . a c t ) R K × N c o n v , 1 × N c o n v , 2
    can be expressed by
    r k , s 1 m p , s 2 m p o u t p u t = max v 1 = 1 V 1 max v 2 = 1 V 2 ( r k , b 1 β m p + v 1 , b 2 ω m p + v 2 c o n v . a c t ) ;
    where
    k [ 1 , 2 , , K ] β m p = s 1 m p 1 ω m p = s 2 m p 1 s 1 m p [ 1 , 2 , , N m a x p , 1 ] s 2 m p [ 1 , 2 , , N m a x p , 2 ] N m a x p , 1 = N c o n v , 1 V 1 b 1 + 1 , a n d N m a x p , 2 = N c o n v , 2 V 2 b 2 + 1
  • Dropout step: Mitigates overfitting by randomly setting a fraction of the tensor’s elements to zero during training. This regularization technique reduces the network’s reliance on specific neurons, improving its generalization to unseen data.
The structure of the dense layers and feature extraction influence the classification performance, and the choice between deep or shallow architectures depends on the characteristics of the dataset and the application tasks.
The used datasets represent shallow datasets, i.e., each has a low number of samples per class relative to the size of dataset required to train deep learning-based methods. In fact, the FUSAR-Ship, MSTAR, and OpenSARShip datasets only have 580, 275, and 224 training samples per class on average, respectively.
Basha et al. [40] reported that shallow models perform better than deeper CNNs on shallower datasets. On the basis of this observation, the number of FC layers is fixed to three.
Another observation reported in [40] is that deeper architectures require fewer neurons in FC layers in order to achieve better performance, regardless of the size and type of the dataset. Therefore, to reduce the output, FC layers decrease in terms of the number of neurons used, that is, if N i and N o u t are, respectively, the number of neurons in the ith and the last FC layers, where i [ 1 , 2 , 3 ] , then N 1 > N 2 > N 3 > N o u t . A parameter N N * is defined to parameterize the number of neurons in the first three FC layers, which are determined as N 1 = N, N 2 = 3 N 4 , and N 3 = N 2 . N o u t = C is the number of targeted classes.
The output of the FC layers depends on the weight between the neurons and the activation function. Let L i and L j represent two fully connected layers, where the neurons in L i are fully connected to those in L j and ( N i , N j ) N * 2 denote their respective number of neurons. If ( y n i , w n i n j ) R 2 represent the output of the n i th neuron in layer L i and the connection weight between this neuron and the n j th neuron in layer L j , the input value of the n j th neuron is calculated as follows:
x n j = n i = 1 N i w n i n j y n i
where n j [ 1 , , N j ] .
Next, the resulting value is processed by a ReLU activation function. In each of the first two FC layers, half of the ReLU activation outputs are set to zero by a dropout before being passed to the next FC layer. No dropout is applied in the last two layers, as the information in these these layers is crucial for classification. Table 1 presents the initial parameter configuration for the proposed CNN architecture, where the bolded settings are fixed and the others are subject to a model selection procedure.
The proposed system minimizes the Cross-Entropy (CE) loss between the truth label classes of training images and their estimates provided by the CNN output layer. The optimization of the weight values is performed using backpropagation and an Adam optimizer with an initial learning rate of 10−4 that was decreased by a factor of 0.2 whenever the validation loss stopped improving for more than 10 epochs (adaptive decay). During each training process, the training dataset is decomposed into batches of 32 images. For each epoch, the metrics (loss and accuracy) in the validation set are calculated, and weights are saved if a lower value of the loss is obtained. Early stopping is applied to terminate the training if the validation loss does not decrease after a set number of epochs, which is fixed at 15 iterations in this study.

2.1.2. Model Selection for the CNN (Methodology)

This subsection discusses the model selection procedure of the proposed CNN architecture. In this part, the variations of some parameters according to permanent settings shown in Table 1 are highlighted.
We aim to find the best numbers and size of convolutional kernels, as well as the best number of neurons in FC layers resulting in the best possible performance on the validation set.
To ease this analysis, we consider architectures where the size of convolutional kernels is the same for all CONV layers, i.e., ( U 1 , U 2 ) i = ( U 1 , U 2 ), i [ 1 , 2 , 3 ] . Squared kernels are tested with sizes ranging from (2,2) to (28,28) (Figure 1). So, the optimal parameters for ( U 1 , U 2 ) are (4,4) for FUSAR-Ship, near (20,20) for OpenSARShip, and near (25,25) for MSTAR.
We then explore variations where the number of convolutional kernels in the last CONV layer is twice that of the first layer, i.e., K 3 = 2 K 1 , and where the number of kernels in the second CONV layer is set equal to one of the other layers, i.e., K 2 = K 1 or K 2 = K 3 . High numbers of kernels ( K i > 512 , i [ 1 , 2 , 3 ] ) are not tested to avoid excessively long training times and to reduce the risk of overfitting (Table 2).
With the number of neurons in the FC layers defined by the parameter N, we explore different variations of this parameter (Table 3).
Carefully selecting the width of the CNN is crucial for achieving optimal performance. At this stage, the number and size of kernels are set to their optimal values. The output tensor from the final CONV layer has dimensions N f = 2 × 2 × K 3 . Given the decreasing number of neurons in the FC layers, the following conditions must be met: N 1 < N f , and N ( n f c 1 ) > N o u t . These conditions restrict the parameter N in the interval [ N f ,2 N o u t ]. We then choose a set of values uniformly distributed over this interval to be evaluated as N.
The optimal parameters for the proposed CNN model for each of the considered datasets are shown in Table 4. Note that better prediction performance is found for deeper datasets, e.g., OpenSARShip and MSTAR. These parameters are used in the CNN-LSTM hybrid network architecture (Section 3.3).

3. Recurrent Neural Networks (RNNs)

3.1. Description of RNN

An RNN is a DNN that possesses recurrent connections which give the ability to map an input sequence to an output sequence while at each step taking the information of previous steps into account.
Two major difficulties have been identified when training an RNN: the vanishing and exploding gradient problems [41]. When the gradient vanishes, the network basically stops learning, and when it explodes, it can cause weights to oscillate between different values [3]. These two phenomena have a similar origin. When applying backpropagation on DNNs, one must concatenate more and more multiplications of activations as it goes back in the network. These activations are bounded, in the case of the sigmoid function, between [0, 1], and this causes the gradient signal to vanish. This problem appears in every DNN, although simple solutions have been found for feed forward networks, such as the use of the ReLU [42] instead of the sigmoid function and the introduction of skip connections in so-called Residual Networks (ResNets) [35].

3.2. Long Short-Term Memory Network (LSTM)

Hochreiter and Schmidhuber [4] proposed a way around the vanishing/exploding gradient problems to allow RNNs to learn long-term dependencies by introducing a gating mechanism: the long short-term memory (LSTM) (cf. Figure 2).
The first step in an LSTM is to decide what information we are going to throw away from the cell state. This decision is made by a sigmoid layer called the forget gate layer. It looks at h t 1 and x t and outputs a number between 0 (completely forget) and 1 (completely keep) for each number in the cell state C t 1 :
f t = σ ( W f · [ h t 1 , x t ] + b f )
The subsequent step involves determining the new information to be stored in the cell state. Initially, a sigmoid layer, known as the input gate layer, determines which values need updating. Following this, a tanh layer generates a vector of new potential candidate values, C ˜ t , that might be incorporated into the state:
i t = σ ( W i · [ h t 1 , x t ] + b i ) C ˜ t = tan h ( W C · [ h t 1 , x t ] + b C )
Afterwards, the old cell state, C t 1 , is updated into the new cell state C t . The old state is multiplied by f t , then the new scaled candidate values i t C ˜ t are added:
C t = f t C t 1 + i t C ˜ t
Finally, the output will be a filtered version of the cell state. Initially, a sigmoid layer determines which portions of the cell state will be sent as output. Following this, the cell state undergoes a tanh transformation (to limit the values between −1 and 1) and is then multiplied by the output of the sigmoid gate, ensuring that only the chosen parts are output:
o t = σ ( W o · [ h t 1 , x t ] + b o ) h t = o t tan h ( C t )

3.3. Combined CNN-LSTM Hybrid Network Adopted

3.3.1. Description of CNN-LSTM Architecture Adopted

In this work, a hybrid method was developed to classify ships using SAR images. The structure of this architecture is conceived by combining CNN and LSTM networks where a CNN is used to extract the complex features from input images and an LSTM is used as a classifier.
Figure 3 illustrates the proposed combined network for SAR image classification. Each CONV layer has the same steps described in Section 2. The convolutional kernel is extracted by multiplying the superposition matrix in all convolution operations. In the last part of the architecture, the function map is flattened into K 3 vectors of length 2 × 2 transferred to the LSTM layer to extract dependency information in terms of kernel ranking. This advanced RNN layer consists of a multiple-input multiple-output (MIMO) structure where several flattened convolutional kernels with multiple ranks, analogous to time steps in the common use of LSTMs, are fed to the network to obtain multiple output features. The output of the LSTM layer is K 3 vectors of length n h i d d e n , where n h i d d e n is the size of the hidden state, which will be optimized during model selection since the performance of such a network depends on this hyperparameter. In total, 50 % dropout layers are applied to the outputs of hidden layers, and then an FC layer with N neurons connects the hidden state to the FC layer of the softmax function. Finally, this final FC layer is used to predict into N o u t categories presented in the given dataset.
The structure of the proposed architecture is shown in Table 5. Layers 1–6 of the network are convolutional layers, and layer 7 is the LSTM layer. After the CONV layers, the output shape is found ( K 3 , 2, 2) per image. The input size of the LSTM layer is ( K 3 , 4). After analyzing LSTM characteristics, the architecture finally sorts SAR images through an FC layer and a softmax layer.

3.3.2. Model Selection for the CNN-LSTM (Methodology)

To simplify the analysis, the convolutional kernels are assumed to have the same size across all CONV layers. Squared kernels with sizes ranging from (2,2) to (28,28) are tested (Figure 4). So, the optimal parameters for ( U 1 , U 2 ) are (11,11) for FUSAR-Ship, close to (18,18) for OpenSARShip, and near (24,24) for MSTAR.
In Table 6, we present the accuracy of the model on the validation basis obtained for each dataset. The hyperparameters with the best accuracy are retained.
In Table 7, we varied the size of hidden state ( n h i d d e n ) using the hyperparameters selected in the previous phase. Furthermore, in Table 8, we varied the number of neurons in the FC layer (N) using the hyperparameters selected in the previous phase.
The optimal parameters for the proposed CNN-LSTM network are shown in Table 9 for each dataset. It can be noticed that the optimal CNN-LSTM architectures result in slightly higher prediction accuracies than those of the optimal CNNs, yet this trend has to be verified on the testing dataset.

4. Brief Presentation of the SAR Datasets

Unlike optical images in computer vision, which can be easily collected and interpreted, SAR images are much more difficult to annotate due to their complex properties. Several publicly available datasets of SAR images were identified with which to conduct experiments and evaluate ship classification using the proposed architecture. In this section, we briefly present the applied datasets we chose to work on, MSTAR, OpenSARShip, and FUSAR-Ship.

4.1. MSTAR Data and Pre-Processing

The MSTAR (Moving and Stationary Target Acquisition and Recognition) database (https://www.sdms.afrl.af.mil/index.php?collection=mstar, accessed on 10 July 2022) [44] contains a set of images collected in 1996 in X band (8–12 GHz) with HH polarization and a resolution of 30 cm. The used acquisition mode is the hyperfine capture or Spotlight, which allows one to have better resolutions because the airborne radar is always directed towards the target during its movement [45].
Figure 5 presents an example of SAR images of the different targets. The pixel values of MSTAR data are on a scale of 0 to 255. Therefore, we convert them to floating point numbers and normalize the values by applying a factor of 1 255 . The images of MSTAR are of size 128 × 128 pixels. The database is provided with an already restored distribution into test and entire training databases. For the training process, we randomly split the entire training database into training and validation subsets with ratios of 80 % and 20 % , respectively. The validation base allows the monitoring of the quality of training and thus serves as a good indicator for hyperparameter tuning for model selection.
The MSTAR database is a publicly available dataset of synthetic aperture radar (SAR) images. This benchmark is used for automatic target recognition (ATR) tasks. It consists of 5165 images of 10 classes that correspond to different ground targets, such as trucks, tanks, and cars. The numbers of images per class for entire training and test sets are summarized in Table 10.

4.2. OpenSARShip Data and Pre-Processing

The SAR image database OpenSARShip [46] is used in the recent scientific literature for the evaluation of SAR image classification algorithms [33,47]. This dataset is composed of SAR image chips of ships, extracted from images produced by Sentinel-1 satellites. These vignettes are derived from two types of products: SLC and GRD. For each of these products, VV and VH polarities are provided, in amplitude only for GRD and in complex for SLC. The corresponding files are provided in different forms: original data, calibrated data, pseudo-color visualization, and grayscale visualization.
The characteristics of the objects in this dataset are also very variable. In particular, the dataset presents a very variable number of instances per class, as well as a great variability of the dimensions and the resolution of the images.
The dataset includes 5673 objects in 68 different classes. The classes correspond to the “Elaborated_type” characteristic present in the metadata of each object provided in the OpenSARShip dataset. Note that different classes have a very small number of instances, which does not allow for efficient deep learning to extract discriminative information that is useful in the generalization phase. To overcome this problem, we retained only the most represented classes of this dataset when using it. This observation is generally noted and adopted in the literature. The authors in [33,47] retained only three classes: respectively, {Bulk Carrier, Container Ship, Tanker} and {Cargo, Bulk Carrier, Container Ship}. In this study, we retained three classes: {Cargo, Bulk Carrier, Container Ship}.
The SAR images in this dataset also vary greatly in size and resolution, and this difference can greatly complicate the learning task of the neural network (Figure 6). Indeed, objects of the same class but with very different resolutions present different characteristics. It is then more complex for a neural network to extract discriminating characteristics of a class compared to the others. For this reason, the authors in [47] retained images with a minimum size of 70 × 70 pixels to ensure a minimum resolution when using them. So, we also selected and retained only the targets with the following characteristics:
  • Type OD product: GRD.
  • Polarization: VV.
  • Image size: > 70 × 70 , and we resized the images to 128 × 128 pixels.
  • Class: {Cargo, Bulk Carrier, Container Ship}.
Figure 6. Examples of objects (a) from the OpenSARShip dataset and (b) from the selected OpenSARShip subpart (VV amplitude).
Figure 6. Examples of objects (a) from the OpenSARShip dataset and (b) from the selected OpenSARShip subpart (VV amplitude).
Sensors 24 07954 g006
In order to perform training, we needed to batch the images, which required images of the same size. We therefore resized the images to 128 × 128 pixels. Each image was either cropped while keeping the central part or filled with null pixels. Therefore, we did not change the pixel values since no interpolation was performed.
For the evaluation of deep learning algorithms, the data were divided into three subparts: a training base, a validation base, and a test base. We thus obtained the class distribution in the four subsets given by Table 11. The training base allowed the model to learn the deep learning algorithms. The validation base allowed the monitoring of the quality of the training process and thus served as an indicator for hyperparameter tuning. The entire training set, which contains both training and validation sets, was used to re-train the model with the optimal hyperparameters. The test base allowed a prediction performance evaluation independently of the training process and thus allowed us to measure the generalization capability of the trained algorithm.

4.3. FUSAR-Ship Data and Pre-Processing

FUSAR-Ship is an open SAR-AIS matchup dataset derived from the Gaofen-3 satellite, backed by the Key Laboratory for Information Science of Electromagnetic Waves (MoE) at Fudan University. Gaofen-3 (GF-3) serves as China’s inaugural civil C-Band fully polarimetric spaceborne synthetic aperture radar (SAR), mainly tasked with oceanic remote sensing and marine monitoring. The FUSAR-Ship dataset was assembled using an automatic SAR-AIS matchup procedure applied to over 100 GF-3 scenes, encompassing a wide range of sea, land, coastal, river, and island environments (Figure 7). It comprises more than 5000 ship image chips with corresponding AIS messages (AIS: automatic identification system) and includes various other maritime targets and background clutter. FUSAR-Ship is designed as a public benchmark dataset for the detection and recognition of ships and marine targets [48].
All image chips are 512 × 512 pixels and are extracted from the original GF-3 L1A images. The ship is consistently positioned at the center, though the chip might also contain adjacent ships or other items.
In this study, we chose a subset that contains four common ship classes, Cargo, Bulk Carrier, Fishing, and Tanker, from the original dataset. For the evaluation of deep learning algorithms, we proceeded with the same process presented for the two previous datasets. The data were divided into three subparts: a training set, a validation set, and a test set. We thus obtained the class distribution in the four subsets given by Table 12.

5. Experiments and Results on SAR Images

The PyTorch open source library was used to implement the proposed solution in Python 3.7. Training was supported by an NVIDIA-SMI 440.82 with CUDA toolkit 10.1.
Figure 8 and Figure 9 show a comparison between the training plots of the proposed networks for each of the FUSAR-Ship, OpenSARShip, and MSTAR datasets. The evolution of CE loss during training is shown in Figure 8, while the evolution of classification accuracy is shown in Figure 9. We can see that for the case of the OpenSARShip dataset, the combined network is more stable while learning and converges faster. In contrast, an opposite trend is observed for the deeper MSTAR dataset. In this case, the proposed CNN converges faster, but it has a stability that is similar to that of the combined network.
By analyzing the results, it is demonstrated that a combination of CNN and LSTM has significant effects on the classification of ships based on the automatic extraction of features from SAR images.
Figure 10, Figure 11 and Figure 12 depict the normalized confusion matrices of the proposed DNNs for ship classification on the FUSAR-Ship, OpenSARShip, and MSTAR testing sets, respectively.
For the OpenSARShip test set, the proposed methods have a relatively good classification performance for the class “Bulk Carrier”, which is the most represented class in the test set, i.e., 62.13% of testing samples. For both networks, there are still cases of misclassification for the least represented classes. It can be seen that for the CNN, 20 of “Cargo” ships were misjudged as “Bulk Carrier”, with an inter-class error of 64.5%, and 13 of the ”Container Ship” images were misjudged as “Bulk Carrier”, with an inter-class error of 39.4%, indicating that both “Cargo” and “Container Ship” are the classifications that can be easily confused with “Bulk Carrier”. This is mainly due to the imbalance between classes in terms of the number of samples and the low number of training images. Overall, the main difference between both networks is that for the hybrid network, the inter-class error is null between the least represented classes, that is, ”Cargo” and ”Container Ship”, in contrast to 3.2% for the proposed CNN.
Regarding the MSTAR dataset, both proposed architectures classify test images with high accuracy. Among ten classes, there are four classes whose images are perfectly classified, and there are, respectively, six and seven classes whose classification accuracy is higher than 99% for the CNN and the CNN-LSTM. For both cases, the class with the relatively low classification accuracy is “BMP2”. Regarding the prediction of images belonging to this class, it is found that the proposed CNN-LSTM network slightly outperforms the competitive CNN network as it has a lesser confusion ratio between “BMP2” and “T72”, i.e., 9% for the combined network in contrast to 6% for the CNN. Hence, the proposed CNN-LSTM system can efficiently classify ship images on a deeper and more balanced dataset.
The comparative study presented in Table 13, Table 14 and Table 15 evaluates the performance of the proposed CNN and CNN-LSTM networks against existing state-of-the-art architectures. This analysis reveals the strengths and limitations of the proposed methods, offering a balanced perspective on their application.
One significant advantage of the proposed methods is their efficiency in training time. The CNN-LSTM network substantially reduces training time compared to both the standalone CNN and other architectures. On the OpenSARShip dataset (Table 14), the CNN-LSTM network completes training in just 132.61 s, whereas VGG16 requires 718.57 s. Such efficiency makes the CNN-LSTM particularly suitable for time-sensitive applications and scenarios with computational constraints. Additionally, the CNN-LSTM achieves a marked reduction in test loss, improving probabilistic class predictions. On the OpenSARShip and MSTAR datasets, the test loss is reduced by 63.93% and 38.78%, respectively, when compared to the standalone CNN (Table 14 and Table 15).
The CNN-LSTM also delivers competitive accuracy, outperforming most existing architectures. For instance, on the OpenSARShip dataset, the CNN-LSTM achieves an accuracy of 70.41%, which is higher than ResNet50 (57.99%) and Xception (65.09%) (Table 14). Furthermore, the proposed methods demonstrate strong adaptability to datasets with fewer and unbalanced instances, such as OpenSARShip, highlighting their robustness in challenging scenarios. The lightweight design of the standalone CNN, with a parameter count as low as 363 k on FUSAR-Ship, adds to its appeal for resource-constrained environments (Table 13).
However, the proposed methods have some limitations. While the CNN-LSTM offers significant efficiency gains, its accuracy improvements over top-performing architectures like VGG16 are modest. On the MSTAR dataset, the CNN-LSTM achieves 98.35% accuracy, only slightly higher than VGG16’s 98.14%. Additionally, both the CNN and CNN-LSTM networks exhibit higher test loss on certain datasets, such as FUSAR-Ship, where their test loss ((4.8501 and 4.1756, respectively) surpasses that of DenseNet121 (3.4620) and Xception (2.8643) (Table 13).
Another drawback is the increased parameter count of the CNN-LSTM compared to the standalone CNN. On the MSTAR dataset, the CNN-LSTM’s parameters reach 59.33 M, significantly more than the standalone CNN’s 31.10M, potentially increasing memory requirements (Table 14). Lastly, while the CNN-LSTM performs well on datasets like OpenSARShip and MSTAR, its variable performance across datasets, such as the higher test loss on FUSAR-Ship, suggests it may not universally outperform existing architectures (Table 13).
In summary, the proposed CNN and CNN-LSTM networks offer significant advantages in efficiency and robustness, particularly for datasets with unbalanced classes or in resource-constrained environments. However, their marginal accuracy improvements and variability in performance across datasets indicate that further refinements may be necessary to achieve consistent superiority over state-of-the-art architectures. These methods remain a promising step forward in designing efficient and adaptable neural networks.

6. Conclusions

This study addresses the challenges of using deep neural networks (DNNs) for target classification in synthetic aperture radar (SAR) images, particularly with limited labeled data. By balancing complexity and performance, we proposed a CNN architecture and evaluated it on the OpenSARShip, MSTAR, and FUSAR-Ship datasets, focusing on supervised learning with minimal annotated data. Key hyperparameters were optimized through a validation-based model selection process, ensuring robust generalization. We further proposed replacing dense layers with LSTM layers for convolutional feature classification. This combination enhanced both classification accuracy and training efficiency. Comparative analysis demonstrated that our model offers competitive performance while maintaining lower computational costs compared to state-of-the-art architectures commonly used in optical image processing.
Looking ahead, future studies could explore integrating shallow DNNs with attention mechanisms [26] to focus selectively on relevant image regions, thereby improving classification accuracy. Additionally, including advanced techniques such as dilated convolutions—including Hybrid Dilated CNNs (HDCs) [49], mixed convolutional kernels [11], and multi-dilated convolutional blocks—could further enhance feature extraction and reduce reliance on pooling operations [50].
Unsupervised learning represents another promising avenue, enabling the use of large unannotated datasets to independently learn essential image characteristics. Expanding annotated datasets through Generative Adversarial Networks (GANs) or other data augmentation methods could also significantly improve performance. These perspectives hold potential for advancing ship classification and other remote sensing applications.

Author Contributions

Methodology, A.T., J.-C.C., M.A. and A.K.; software, A.T. and M.A.; validation, A.T., M.A. and J.-C.C.; formal analysis, A.T., J.-C.C. and A.K.; supervision, A.T. and A.K.; project administration, A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the AID-DGA (Agence de l’innovation de défense, Direction Générale de l’Armement France), who supported this research.

Data Availability Statement

We used in this study three publicly available datasets: MSTAR, OpenSARShip, and FUSAR-Ship. The MSTAR dataset can be found at https://www.sdms.afrl.af.mil/index.php?collection=mstar (accessed on 3 July 2022). The OpenSARShip dataset is available at https://opensar.sjtu.edu.cn/DataAndCodes.html (accessed on 3 July 2022). The FUSAR-ship dataset can be downloaded at https://drive.usercontent.google.com/download?id=1SOEMud9oUq69gxbfcBkOvtUkZ3LWEpZJ&export=download&authuser=0, last accessed on 6 December 2024.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
CNNConvolutional neural network
LSTMLong short-term memory
MS-CNNMulti-Stream CNN
Conv-BiLSTMConvolutional bidirectional long short-term memory
CBLPNConv-BiLSTM Prototypical Network
SARSynthetic aperture radar
MSTARMoving and Stationary Target Acquisition and Recognition
AISAutomatic identification system
DNNDeep neural network
FCFully connected
CECross Entropy
GPUGraphics processing unit
FPGAField-Programmable Gate Array
ASICApplication-Specific Integrated Circuit
RNNRecurrent neural network
GRDGround Range Detected
SLCSingle Look Complex
VVVertical–Vertical polarization
VHVertical–Horizontal polarization
HDCHybrid Dilated CNN
GANGenerative Adversarial Network
ReLURectified Linear Unit
MIMOMultiple input multiple output
ITInferotemporal Cortex
RGCRetinal Ganglion Cells
LGNLateral Geniculate Nucleus
IoTInternet of Things
CSIChannel State Information
RFRadio Frequency
SSIDService Set Identifier
BPTTBackpropagation Through Time
HOGHistogram of Oriented Gradients
GRSSGeoscience and Remote Sensing Society
ATRAutomatic target recognition
CUDACompute Unified Device Architecture
SMISystem Management Interface
VGGVisual Geometry Group
ResNetResidual Network
XceptionExtreme Inception
DenseNetDensely Connected Convolutional Networks
EfficientNetEfficient Network
MobileNetMobile Network
FUSAR-ShipFudan University SAR-Ship
HHHorizontal–Horizontal polarization

References

  1. Zhang, X.; Ramachandran, A.; Zhuge, C.; He, D.; Zuo, W.; Cheng, Z.; Rupnow, K.; Chen, D. Machine learning on FPGAs to face the IoT revolution. In Proceedings of the 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Irvine, CA, USA, 13–16 November 2017; pp. 819–826. [Google Scholar] [CrossRef]
  2. Andri, R.; Cavigelli, L.; Rossi, D.; Benini, L. YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights. In Proceedings of the 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Pittsburgh, PA, USA, 11–13 July 2016; pp. 236–241. [Google Scholar] [CrossRef]
  3. Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In A Field Guide to Dynamical Recurrent Neural Networks; Kremer, S.C., Kolen, J.F., Eds.; IEEE Press: Piscataway, NJ, USA, 2001. [Google Scholar]
  4. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  5. Islam, M.Z.; Islam, M.M.F.; Asraf, A. A Combined Deep CNN-LSTM Network for the Detection of Novel Coronavirus (COVID-19) Using X-ray Images. medRxiv 2020, 20, 100412. [Google Scholar] [CrossRef] [PubMed]
  6. Lu, C.; Li, W. Ship Classification in High-Resolution SAR Images via Transfer Learning with Small Training Dataset. Sensors 2019, 19, 63. [Google Scholar] [CrossRef] [PubMed]
  7. Li, J.; Qu, C.; Peng, S. Ship classification for unbalanced SAR dataset based on convolutional neural network. J. Appl. Remote Sens. 2018, 12, 035010. [Google Scholar] [CrossRef]
  8. Xiao, Q.; Liu, B.; Li, Z.; Ni, W.; Yang, Z.; Li, L. Progressive Data Augmentation Method for Remote Sensing Ship Image Classification Based on Imaging Simulation System and Neural Style Transfer. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 9176–9186. [Google Scholar] [CrossRef]
  9. Geng, Z.; Xu, Y.; Wang, B.N.; Yu, X.; Zhu, D.Y.; Zhang, G. Target Recognition in SAR Images by Deep Learning with Training Data Augmentation. Sensors 2023, 23, 941. [Google Scholar] [CrossRef] [PubMed]
  10. Lang, H.; Yang, G.; Li, C.; Xu, J. Multisource Heterogeneous Transfer Learning via Feature Augmentation for Ship Classification in SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
  11. Zhang, W.; Zhu, Y.; Fu, Q. Semi-Supervised Deep Transfer Learning-Based on Adversarial Feature Learning for Label Limited SAR Target Recognition. IEEE Access 2019, 7, 152412–152420. [Google Scholar] [CrossRef]
  12. Zaied, S.; Toumi, A.; Khenchaf, A. Target classification using convolutional deep learning and auto-encoder models. In Proceedings of the 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Sousse, Tunisia, 21–24 March 2018; pp. 1–6. [Google Scholar] [CrossRef]
  13. Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; Hu, W.; Maybank, S. Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4854–4863. [Google Scholar] [CrossRef]
  14. Zhang, T.; Zhang, X. Squeeze-and-Excitation Laplacian Pyramid Network With Dual-Polarization Feature Fusion for Ship Classification in SAR Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  15. Zhang, T.; Zhang, X.; Ke, X.; Liu, C.; Xu, X.; Zhan, X.; Wang, C.; Ahmad, I.; Zhou, Y.; Pan, D.; et al. HOG-ShipCLSNet: A Novel Deep Learning Network With HOG Feature Fusion for SAR Ship Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–22. [Google Scholar] [CrossRef]
  16. Khenchaf, Y.; Toumi, A. Siamese Neural Network for Automatic Target Recognition Using Synthetic Aperture Radar. In Proceedings of the International Geoscience and Remote Sensing Symposium (IGARSS), Pasadena, CA, USA, 16–21 July 2023. [Google Scholar]
  17. Zhang, T.; Zhang, X. Injection of Traditional Hand-Crafted Features into Modern CNN-Based Models for SAR Ship Classification: What, Why, Where, and How. Remote Sens. 2021, 13, 2091. [Google Scholar] [CrossRef]
  18. Toumi, A.; Cexus, J.C.; Khenchaf, A. A proposal learning strategy on CNN architectures for targets classification. In Proceedings of the 2022 6th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Sfax, Tunisia, 24–27 May 2022; pp. 1–6. [Google Scholar] [CrossRef]
  19. Raj, J.A.; Idicula, S.M.; Paul, B. One-Shot Learning-Based SAR Ship Classification Using New Hybrid Siamese Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  20. Lang, H.; Wang, R.; Zheng, S.; Wu, S.; Li, J. Ship Classification in SAR Imagery by Shallow CNN Pre-Trained on Task-Specific Dataset with Feature Refinement. Remote Sens. 2022, 14, 5986. [Google Scholar] [CrossRef]
  21. Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep Learning for SAR Ship Detection: Past, Present and Future. Remote Sens. 2022, 14, 2712. [Google Scholar] [CrossRef]
  22. Er, M.; Zhang, Y.; Chen, J.; Gao, W. Ship detection with deep learning: A survey. Artif. Intell. Rev. 2023, 56, 1–41. [Google Scholar] [CrossRef]
  23. Zhao, P.; Liu, K.; Zou, H.; Zhen, X. Multi-Stream Convolutional Neural Network for SAR Automatic Target Recognition. Remote Sens. 2018, 10, 1473. [Google Scholar] [CrossRef]
  24. Wang, L.; Bai, X.; Zhou, F. Few-Shot SAR ATR Based on Conv-BiLSTM Prototypical Networks. In Proceedings of the 2019 6th Asia-Pacific Conference on Synthetic Aperture Radar (APSAR), Xiamen, China, 26–29 November 2019; pp. 1–5. [Google Scholar] [CrossRef]
  25. Baird, Z.; Mcdonald, M.K.; Rajan, S.; Lee, S.J. A CNN-LSTM Network for Augmenting Target Detection in Real Maritime Wide Area Surveillance Radar Data. IEEE Access 2020, 8, 179281–179294. [Google Scholar] [CrossRef]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  27. Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. RSMamba: Remote Sensing Image Classification with State Space Model. arXiv 2024, arXiv:2403.19654. [Google Scholar] [CrossRef]
  28. Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar] [CrossRef]
  29. Yang, G.; Lang, H. Semisupervised Heterogeneous Domain Adaptation via Dynamic Joint Correlation Alignment Network for Ship Classification in SAR Imagery. IEEE Geosci. Remote Sens. Lett. 2022, 19, 3175056. [Google Scholar] [CrossRef]
  30. Ma, L.; Tian, S. A Hybrid CNN-LSTM Model for Aircraft 4D Trajectory Prediction. IEEE Access 2020, 8, 134668–134680. [Google Scholar] [CrossRef]
  31. Alsanwy, S.; Asadi, H.; Qazani, M.R.C.; Mohamed, S.; Nahavandi, S. A CNN-LSTM Based Model to Predict Trajectory of Human-Driven Vehicle. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 3097–3103. [Google Scholar] [CrossRef]
  32. Wang, M.; Guo, X.; She, Y.; Zhou, Y.; Liang, M.; Chen, Z.S. Advancements in Deep Learning Techniques for Time Series Forecasting in Maritime Applications: A Comprehensive Review. Information 2024, 15, 507. [Google Scholar] [CrossRef]
  33. Zhang, D.; Liu, J.; Heng, W.; Ren, K.; Song, J. Transfer Learning with Convolutional Neural Networks for SAR Ship Recognition. IOP Conf. Ser. Mater. Sci. Eng. 2018, 322, 072001. [Google Scholar] [CrossRef]
  34. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; Conference Track Proceedings; Bengio, Y., LeCun, Y., Eds.; DBLP: Trier, Germany, 2015. [Google Scholar]
  35. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  36. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  37. Huang, G.; Liu, Z.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  38. Sandler, M.; Howard, A.G.; Zhu, M.; Zhmoginov, A.; Chen, L. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  39. Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
  40. Basha, S.S.; Dubey, S.R.; Pulabaigari, V.; Mukherjee, S. Impact of fully connected layers on performance of convolutional neural networks for image classification. Neurocomputing 2020, 378, 112–119. [Google Scholar] [CrossRef]
  41. Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef]
  42. Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; Proceedings of Machine Learning Research; Gordon, G., Dunson, D., Dudík, M., Eds.; AISTATS: Amherst, MA, USA, 2011; Volume 15, pp. 315–323. [Google Scholar]
  43. Olah, C. Understanding LSTM Networks. 2015. Available online: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ (accessed on 12 March 2022).
  44. Keydel, E.R.; Lee, S.W.; Moore, J.T. MSTAR extended operating conditions: A tutorial. In Proceedings of the Algorithms for Synthetic Aperture Radar Imagery III, Orlando, FL, USA, 8–12 April 1996; Zelnio, E.G., Douglass, R.J., Eds.; International Society for Optics and Photonics—SPIE: Bellingham, WA, USA, 1996; Volume 2757, pp. 228–242. [Google Scholar] [CrossRef]
  45. Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning Augmentation Policies from Data. arXiv 2019, arXiv:1805.09501. [Google Scholar] [CrossRef]
  46. Huang, L.; Liu, B.; Li, B.; Guo, W.; Yu, W.; Zhang, Z.; Yu, W. OpenSARShip: A Dataset Dedicated to Sentinel-1 Ship Interpretation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 195–208. [Google Scholar] [CrossRef]
  47. Huang, Z.; Pan, Z.; Lei, B. What, Where, and How to Transfer in SAR Target Recognition Based on Deep CNNs. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2324–2336. [Google Scholar] [CrossRef]
  48. Hou, X.; Ao, W.; Song, Q.; Lai, J.; Wang, H.; Xu, F. FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition. Sci. China Inf. Sci. 2020, 63, 1–19. [Google Scholar] [CrossRef]
  49. Lei, X.; Pan, H.; Huang, X. A Dilated CNN Model for Image Classification. IEEE Access 2019, 7, 124087–124095. [Google Scholar] [CrossRef]
  50. Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. arXiv 2018, arXiv:1802.10062. [Google Scholar] [CrossRef]
Figure 1. Classification performance of the CNN on the validation set in the function of the squared size of convolutional kernels ( U 1 , U 2 ) with U 1 = U 2 and ( K 1 , K 2 , K 3 ) = (64,64,128) and N = 256.
Figure 1. Classification performance of the CNN on the validation set in the function of the squared size of convolutional kernels ( U 1 , U 2 ) with U 1 = U 2 and ( K 1 , K 2 , K 3 ) = (64,64,128) and N = 256.
Sensors 24 07954 g001
Figure 2. The repeating LSTM module [43].
Figure 2. The repeating LSTM module [43].
Sensors 24 07954 g002
Figure 3. Illustration of the CNN-LSTM network for SAR image classification.
Figure 3. Illustration of the CNN-LSTM network for SAR image classification.
Sensors 24 07954 g003
Figure 4. Classification performance of the CNN-LSTM on the validation set in the function of the squared size of convolutional kernels ( U 1 , U 2 ) with U 1 = U 2 and ( K 1 , K 2 , K 3 ) = (64,64,128), n h i d d e n = 128, and N = 128.
Figure 4. Classification performance of the CNN-LSTM on the validation set in the function of the squared size of convolutional kernels ( U 1 , U 2 ) with U 1 = U 2 and ( K 1 , K 2 , K 3 ) = (64,64,128), n h i d d e n = 128, and N = 128.
Sensors 24 07954 g004
Figure 5. Example presentation in the MSTAR dataset. (a) 2S1; (b) BMP2; (c) BRDM2; (d) BTR60; (e) BTR70: (f) D7; (g) T62; (h) T72; (i) ZIL31; (j) ZSU234.
Figure 5. Example presentation in the MSTAR dataset. (a) 2S1; (b) BMP2; (c) BRDM2; (d) BTR60; (e) BTR70: (f) D7; (g) T62; (h) T72; (i) ZIL31; (j) ZSU234.
Sensors 24 07954 g005
Figure 7. Different categories of ships in FUSAR-Ship [48].
Figure 7. Different categories of ships in FUSAR-Ship [48].
Sensors 24 07954 g007
Figure 8. Evolution of the CE loss during training for (a) FUSAR-Ship, (b) OpenSARShip, and (c) MSTAR datasets.
Figure 8. Evolution of the CE loss during training for (a) FUSAR-Ship, (b) OpenSARShip, and (c) MSTAR datasets.
Sensors 24 07954 g008
Figure 9. Evolution of classification accuracy during training for (a) FUSAR-Ship, (b) OpenSARShip, and (c) MSTAR datasets.
Figure 9. Evolution of classification accuracy during training for (a) FUSAR-Ship, (b) OpenSARShip, and (c) MSTAR datasets.
Sensors 24 07954 g009
Figure 10. Normalized confusion matrix of the proposed (a) CNN and (b) CNN-LSTM architectures using FUSAR-Ship dataset.
Figure 10. Normalized confusion matrix of the proposed (a) CNN and (b) CNN-LSTM architectures using FUSAR-Ship dataset.
Sensors 24 07954 g010
Figure 11. Normalized confusion matrix of the proposed (a) CNN and (b) CNN-LSTM architectures using OpenSARShip dataset.
Figure 11. Normalized confusion matrix of the proposed (a) CNN and (b) CNN-LSTM architectures using OpenSARShip dataset.
Sensors 24 07954 g011
Figure 12. Normalized confusion matrix of the proposed (a) CNN and (b) CNN-LSTM architectures using MSTAR dataset.
Figure 12. Normalized confusion matrix of the proposed (a) CNN and (b) CNN-LSTM architectures using MSTAR dataset.
Sensors 24 07954 g012
Table 1. Proposed CNN architecture for ship classification.
Table 1. Proposed CNN architecture for ship classification.
CNN LayerLayer StepsParameters
Input R R N R × N s × N s
CONV #1
Zero padding 2D

Conv 2D

Max-pooling 2D
Dropout
( Z 1 , Z 2 ) 1 = (1,1)
K 1 = 64, ( U 1 , U 2 ) 1 = (2,2)
( a 1 , a 2 ) 1 = (1,1)
act = ’ReLU’
( V 1 , V 2 ) 1 = (4,4), ( b 1 , b 2 ) 1 = (4,4)
25%
CONV #2
Zero padding 2D

Conv 2D

Max-pooling 2D
Dropout
( Z 1 , Z 2 ) 2 = (1,1)
K 2 = 64, ( U 1 , U 2 ) 2 = (2,2)
( a 1 , a 2 ) 2 = (1,1)
act = ’ReLU’
( V 1 , V 2 ) 2 = (4,4), ( b 1 , b 2 ) 2 = (4,4)
25%
CONV #3
Zero padding 2D

Conv 2D

Max-pooling 2D
Dropout
( Z 1 , Z 2 ) 3 = (1,1)
K 3 = 128, ( U 1 , U 2 ) 3 = (2,2)
( a 1 , a 2 ) 3 = (1,1)
act = ’ReLU’
( V 1 , V 2 ) 3 = (4,4), ( b 1 , b 2 ) 3 = (4,4)
25%
FC #1
Dense

Dropout
N 1 = N, act = ’ReLU’
50%
FC #2
Dense

Dropout
N 2 = N 2 , act = ’ReLU’
50%
OutputDense N 3 = N o u t = C, act = ’identity’
Table 2. Validation performance of the CNN for variations in the numbers of kernels in CONV layers with optimized values of ( U 1 , U 2 ) and N = 256 .
Table 2. Validation performance of the CNN for variations in the numbers of kernels in CONV layers with optimized values of ( U 1 , U 2 ) and N = 256 .
( K 1 , K 2 , K 3 )Validation Accuracy (%)
FUSAR-ShipOpenSARShipMSTAR
(32,32,64)64.9573.0497.55
(32,64,64)65.0373.1997.92
(64,64,128)65.9474.8197.96
(64,128,128)65.5175.5698.65
(128,128,256)65.0374.0798.72
(128,256,256)63.5773.9398.69
(256,256,512)64.3474.0798.65
(256,512,512)63.6174.2299.12
The cells highlighted in bold indicate the highest validation accuracies achieved for each dataset.
Table 3. Validation performance of the CNN for variations in FC layer neurons with the optimal values of ( U 1 , U 2 ) and ( K 1 , K 2 , K 3 ).
Table 3. Validation performance of the CNN for variations in FC layer neurons with the optimal values of ( U 1 , U 2 ) and ( K 1 , K 2 , K 3 ).
NValidation Accuracy (%)
FUSAR-ShipOpenSARShipMSTAR
12864.6073.4898.91
25665.9475.5699.12
38464.8275.4198.83
51264.3474.2298.69
The cells highlighted in bold indicate the highest validation accuracies achieved for each dataset.
Table 4. Optimal parameters of the proposed CNN for each dataset and validation performances of the optimal architectures.
Table 4. Optimal parameters of the proposed CNN for each dataset and validation performances of the optimal architectures.
DatasetOptimal ParametersValidation
Accuracy (%)
( U 1 , U 2 )( k 1 , k 2 , k 3 )N
FUSAR-Ship(4,4)(64,64,128)25665.94
OpenSARShip(20,20)(64,128,128)25675.56
MSTAR(25,25)(128,128,256)25699.12
Table 5. The full summary of the proposed CNN-LSTM hybrid network.
Table 5. The full summary of the proposed CNN-LSTM hybrid network.
LayerTypeKernelKernel SizeStrideInput Size
1Convolution2D K 1 U 1 × U 2 1 1 × 128 × 128
2Pool- 4 × 4 4 K 1 × 128 × 128
3Convolution2D K 2 U 1 × U 2 1 K 1 × 32 × 32
4Pool- 4 × 4 4 K 2 × 32 × 32
5Convolution2D K 3 U 1 × U 2 1 K 2 × 8 × 8
6Pool- 4 × 4 4 K 3 × 8 × 8
7LSTM--- K 3 × 4
8FCN-- K 3 . n h i d d e n
9Softmax N o u t --N
Table 6. Validation performance of the CNN-LSTM for variations in the numbers of kernels in CONV layers with optimized values of ( U 1 , U 2 ) and n h i d d e n = 128 , N = 256 .
Table 6. Validation performance of the CNN-LSTM for variations in the numbers of kernels in CONV layers with optimized values of ( U 1 , U 2 ) and n h i d d e n = 128 , N = 256 .
( K 1 , K 2 , K 3 )Validation Accuracy (%)
FUSAR-ShipOpenSARShipMSTAR
(32,32,64)63.7874.5397.72
(32,64,64)64.0474.0698.35
(64,64,128)65.2975.7899.04
(64,128,128)63.4874.8498.64
(128,128,256)65.2074.8498.90
(128,256,256)64.8274.5399.15
(256,256,512)63.7474.8499.01
(256,512,512)64.1775.0099.04
The cells highlighted in bold indicate the highest validation accuracies achieved for each dataset.
Table 7. Validation performance of the CNN-LSTM for variations in the size of hidden state ( n h i d d e n ) with the optimal values of ( U 1 , U 2 ) and ( K 1 , K 2 , K 3 ) and N = 128 .
Table 7. Validation performance of the CNN-LSTM for variations in the size of hidden state ( n h i d d e n ) with the optimal values of ( U 1 , U 2 ) and ( K 1 , K 2 , K 3 ) and N = 128 .
n h i d d e n Validation Accuracy (%)
FUSAR-ShipOpenSARShipMSTAR
3265.1675.3199.23
6464.5275.4799.08
9663.7873.9199.19
12865.2975.7899.15
16064.3474.6999.15
19264.6974.8498.71
The cells highlighted in bold indicate the highest validation accuracies achieved for each dataset.
Table 8. Classification performance of the CNN-LSTM on the validation set in the function of the number of neurons in the FC layer (N), with the optimized ( U 1 , U 2 ), ( K 1 , K 2 , K 3 ), and n h i d d e n hyperparameters.
Table 8. Classification performance of the CNN-LSTM on the validation set in the function of the number of neurons in the FC layer (N), with the optimized ( U 1 , U 2 ), ( K 1 , K 2 , K 3 ), and n h i d d e n hyperparameters.
NValidation Accuracy (%)
FUSAR-ShipOpenSARShipMSTAR
6464.8275.3299.16
12865.2975.7899.15
18464.8575.1199.34
25664.9474.9899.43
32064.8875.2199.52
38464.9174.8499.49
The cells highlighted in bold indicate the highest validation accuracies achieved for each dataset.
Table 9. Optimal parameters of the proposed combined CNN-LSTM network for each dataset and validation performances of the optimal architectures.
Table 9. Optimal parameters of the proposed combined CNN-LSTM network for each dataset and validation performances of the optimal architectures.
DatasetOptimal ParametersValidation
Accuracy (%)
( U 1 , U 2 )( k 1 , k 2 , k 3 ) n h i d d e n N
FUSAR-Ship(11,11)(64,64,128)12812865.29
OpenSARShip(18,18)(64,64,128)12812875.78
MSTAR(24,24)(128,256,256)3232099.52
Table 10. The distribution of MSTAR data in the entire training/test database.
Table 10. The distribution of MSTAR data in the entire training/test database.
Targets2S1BMP2BRDM2BTR60BTR70D7T62T72ZIL131ZSU234
Entire training299233298256233299299232299299
Test274195274195196274273196274274
Table 11. The number of instances per class in the three subsets resulting from splitting the selected OpenSARShip data.
Table 11. The number of instances per class in the three subsets resulting from splitting the selected OpenSARShip data.
TrainingValidationEntire TrainingTest
80 %  Entire Training 20 %  Entire Training 80 %  Dataset 20 %  Dataset
Cargo992512431
Bulk Carrier33584419105
Container Ship1042613033
Table 12. The number of instances per class in the four subsets resulting from splitting the selected FUSAR-ships dataset.
Table 12. The number of instances per class in the four subsets resulting from splitting the selected FUSAR-ships dataset.
TrainingValidationEntire TrainingTest
80 %  Entire Training 20 %  Entire Training 80 %  Dataset 20 %  Dataset
Cargo10832711354339
Bulk Carrier1744421855
Fishing502126628157
Tanker942411830
Table 13. Performance comparison of the proposed CNN and CNN-LSTM networks with existing systems on FUSAR-Ship dataset.
Table 13. Performance comparison of the proposed CNN and CNN-LSTM networks with existing systems on FUSAR-Ship dataset.
ArchitectureNumber ofTrainingNumberTest LossTest
ParametersTime (s)of Epochs Accuracy (%)
VGG16134.28M711.18994.343665.23
ResNet5023.52M4602.787633.862367.99
Xception20.82M2822.805052.864367.13
DenseNet1216.96M5250.833883.462071.08
EfficientNetB04.01M1178.611312.352661.45
MobileNetV22.23M1090.022502.941457.31
Proposed CNN363k3447.8741094.850167.47
Proposed CNN-LSTM3.66M377.541634.175665.58
Table 14. Performance comparison of the proposed CNN and CNN-LSTM networks with existing systems on OpenSARShip dataset.
Table 14. Performance comparison of the proposed CNN and CNN-LSTM networks with existing systems on OpenSARShip dataset.
ArchitectureNumber ofTrainingNumberTest LossTest
ParametersTime (s)of Epochs Accuracy (%)
VGG16134.27M718.572702.018572.19
ResNet5023.51M3502.1818545.823357.99
Xception20.81M1958.269784.114865.09
DenseNet1216.96M5578.2114478.431956.80
EfficientNetB04.01M254.741112.931652.07
MobileNetV22.23M621.783982.588356.21
Proposed CNN10.02M1375.1016426.965869.82
Proposed CNN-LSTM6.17M132.611662.512470.41
Table 15. Performance comparison of the proposed CNN and CNN-LSTM networks with existing systems on MSTAR dataset.
Table 15. Performance comparison of the proposed CNN and CNN-LSTM networks with existing systems on MSTAR dataset.
ArchitectureNumber ofTrainingNumberTest LossTest
ParametersTime (s)of Epochs Accuracy (%)
VGG16138.30M1073.351320.153298.14
ResNet5023.53M2902.834250.269895.67
Xception20.83M3397.965040.183695.34
DenseNet1216.96M11188.507310.154397.77
EfficientNetB04.02M1541.871550.574586.85
MobileNetV22.24M1671.272754.254140.74
Proposed CNN31.10M1880.302890.123998.52
Proposed CNN-LSTM59.33M686.19740.091098.35
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Toumi, A.; Cexus, J.-C.; Khenchaf, A.; Abid, M. A Combined CNN-LSTM Network for Ship Classification on SAR Images. Sensors 2024, 24, 7954. https://doi.org/10.3390/s24247954

AMA Style

Toumi A, Cexus J-C, Khenchaf A, Abid M. A Combined CNN-LSTM Network for Ship Classification on SAR Images. Sensors. 2024; 24(24):7954. https://doi.org/10.3390/s24247954

Chicago/Turabian Style

Toumi, Abdelmalek, Jean-Christophe Cexus, Ali Khenchaf, and Mahdi Abid. 2024. "A Combined CNN-LSTM Network for Ship Classification on SAR Images" Sensors 24, no. 24: 7954. https://doi.org/10.3390/s24247954

APA Style

Toumi, A., Cexus, J.-C., Khenchaf, A., & Abid, M. (2024). A Combined CNN-LSTM Network for Ship Classification on SAR Images. Sensors, 24(24), 7954. https://doi.org/10.3390/s24247954

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop