Ship Classification Based on Attention Mechanism and Multi-Scale Convolutional Neural Network for Visible and Infrared Images

Visible image quality is very susceptible to changes in illumination, and there are limitations in ship classification using images acquired by a single sensor. This study proposes a ship classification method based on an attention mechanism and multi-scale convolutional neural network (MSCNN) for visible and infrared images. First, the features of visible and infrared images are extracted by a two-stream symmetric multi-scale convolutional neural network module, and then concatenated to make full use of the complementary features present in multi-modal images. After that, the attention mechanism is applied to the concatenated fusion features to emphasize local details areas in the feature map, aiming to further improve feature representation capability of the model. Lastly, attention weights and the original concatenated fusion features are added element by element and fed into fully connected layers and Softmax output layer for final classification output. Effectiveness of the proposed method is verified on a visible and infrared spectra (VAIS) dataset, which shows 93.81% accuracy in classification results. Compared with other state-of-the-art methods, the proposed method could extract features more effectively and has better overall classification performance.


Introduction
Ship classification plays an important role in military and civilian fields, such as maritime traffic, fishing vessel monitoring, maritime search and rescue, etc. [1,2]. However, in real life, ship classification results are very susceptible to background settings and recognition of intra-class differences among various types of ship has proven difficult. Therefore, ship classification has become one of the research hotspots in pattern recognition.
The main types of ship image are synthetic aperture radar (SAR) images, visible images and infrared images. After the launch of SEASAT in the 1970s, SAR began to be used in marine environmental research. SAR images are immune to light and weather conditions, but they have low resolution and are susceptible to electromagnetic interference as they are radar signals. Visible images, on the other hand, have high resolution and possess detailed texture, but they are easily affected by light conditions. When illumination is insufficient, the acquired image details drop significantly. Infrared images are not affected by light conditions either. Although the resolution is not very high, it has a clear target contour. Moreover, it has practical advantage as an infrared sensor can produce stable Major contributions of this study can be summarized as follows: (1) a two-stream symmetric MSCNN feature extraction module is proposed to extract the features of visible and infrared images. The module can selectively extract those deep features of visible and infrared images with more detailed information. (2) The visible image features and infrared image features are concatenated to allow further use of the complementary information within different modal images, such that a more detailed ship object description can be obtained. (3) The attention mechanism is applied to the concatenated fusion layer to enhance important local details in the feature map, thereby improving overall classification capability of the model.
The remainder of this paper is organized as follows. Section 2 describes the proposed classification method in details. Section 3 introduces the visible and infrared spectra (VAIS) dataset [24] and parameter settings, and analyzes experimental results. Section 4 summarizes conclusions and the prospects of future work.

Framework of Proposed Approach
Visible images are quite vulnerable to light conditions, which is the reason that ship classification relying only on single-senor images is subject to many limitations and deficiencies. On the other hand, a deep learning algorithm can automatically acquire higher-level and more abstract features in the images and an attention mechanism can enhance the feature representation with more effective information in the feature map. However, ship image features extraction using a single-scale convolution kernel is prone to detail omission. Therefore, this study proposes a ship classification method based on attention mechanism and MSCNN for the visible and infrared images, so as to combine the respective advantages of different types of sensor image and improve accuracy of ship classification results. The specific flow chart is shown in Figure 1. The proposed method consists of a feature extraction module, attention mechanism and feature fusion module, and classification module. The feature extraction module uses a two-stream symmetric MSCNN to extract the features of the preprocessed visible and infrared image, respectively. The attention mechanism and feature fusion module first concatenates extracted visible image features and infrared image features, and then obtains attention weights by applying attention mechanism to the concatenated fusion feature layer to enhance key local features, suppress unimportant features, and improve feature expression results of the model. The classification module is composed of three fully connected layers and a Softmax output layer. The ship classification results are obtained through Softmax output layer function. Image preprocessing in the figure refers to image size adjustment (refer Section 2.2).
Electronics 2020, 9, x FOR PEER REVIEW 3 of 20 Major contributions of this study can be summarized as follows: (1) a two-stream symmetric MSCNN feature extraction module is proposed to extract the features of visible and infrared images. The module can selectively extract those deep features of visible and infrared images with more detailed information. (2) The visible image features and infrared image features are concatenated to allow further use of the complementary information within different modal images, such that a more detailed ship object description can be obtained. (3) The attention mechanism is applied to the concatenated fusion layer to enhance important local details in the feature map, thereby improving overall classification capability of the model.
The remainder of this paper is organized as follows. Section 2 describes the proposed classification method in details. Section 3 introduces the visible and infrared spectra (VAIS) dataset [24] and parameter settings, and analyzes experimental results. Section 4 summarizes conclusions and the prospects of future work.

Framework of Proposed Approach
Visible images are quite vulnerable to light conditions, which is the reason that ship classification relying only on single-senor images is subject to many limitations and deficiencies. On the other hand, a deep learning algorithm can automatically acquire higher-level and more abstract features in the images and an attention mechanism can enhance the feature representation with more effective information in the feature map. However, ship image features extraction using a single-scale convolution kernel is prone to detail omission. Therefore, this study proposes a ship classification method based on attention mechanism and MSCNN for the visible and infrared images, so as to combine the respective advantages of different types of sensor image and improve accuracy of ship classification results. The specific flow chart is shown in Figure 1. The proposed method consists of a feature extraction module, attention mechanism and feature fusion module, and classification module. The feature extraction module uses a two-stream symmetric MSCNN to extract the features of the preprocessed visible and infrared image, respectively. The attention mechanism and feature fusion module first concatenates extracted visible image features and infrared image features, and then obtains attention weights by applying attention mechanism to the concatenated fusion feature layer to enhance key local features, suppress unimportant features, and improve feature expression results of the model. The classification module is composed of three fully connected layers and a Softmax output layer. The ship classification results are obtained through Softmax output layer function. Image preprocessing in the figure refers to image size adjustment (refer Section 2.2).

Two-Stream Symmetric Multi-Scale Convolutional Neural Network (MSCNN) Feature Extraction Module
CNN is a feed-forward neural network architecture, which adopts local perception and weight sharing methods. The traditional CNN contains convolutional layers, pooling layers and fully connected layers, etc. Among them, network architecture is one of the core factors that determines classification performance. Choosing an appropriate CNN framework to extract ship image features effectively is a prerequisite for classification performance improvement. However, image features extraction using single-scale convolution kernel is prone to details omission. Inspired by InceptionNet [25], this study uses convolution kernels of different sizes in the convolution layer to extract features of different scales, thereby enriching ship image features.
The proposed MSCNN feature extraction module consists of two identical parallel multi-scale CNNs. MSCNN is mainly composed of 4 convolutional layers (Conv1-Conv4), 3 pooling layers (Max Pooling1-Max Pooling3), 3 fully connected layers (FC1-FC3) and a Softmax output layer. After Conv3, two sets of convolution kernels of different sizes (i.e., 3 × 3 and 5 × 5) are used for parallel convolution to obtain Conv4_1 and Conv4_2, and then Conv4_1 and Conv4_2 are concatenated, and the deep features of the ship image are extracted by using two sets of convolution kernels of different sizes. The obtained feature map contains more detailed information to reduce information loss in image processing. The rectified linear unit (ReLU) function is used in both convolutional layer and fully connected layers as it can prevent gradient disappearance, make the network sparse, and it is more efficient than the sigmoid function. In order to prevent over-fitting, Dropout technology is used in 3 fully connected layers. In each iteration, the dropout technology can randomly hide certain neurons, the selected hidden neurons do not contribute to the parameter updates, which can effectively avoid overfitting and enhance the generalization ability of the network. Their respective parameters are shown in Table 1. It can be seen that the input image size is 227 × 227 × 3. Since the original infrared image is a grayscale image, we create a pseudo-red, green and blue (RGB) image using the same method mentioned in literature [24], where a single infrared channel is duplicated three times. The pooling layer adopts a maximum pooling of size 3 × 3, which reduces the dimensionality of upper layer convolution result, simplifies calculation process, and at the same time retains more image texture. "Padding: 1" refers to edge expansion with one circle of 0. Conv4 refers to the result that Conv4_1 and Conv4_2 are concatenated, such that output size is 13 × 13 × 512. In this study we compare the classification results of visible ship images and results of infrared ship images with a MSCNN and the network (referred to as CNN) that only uses 256 convolution kernels of size 3 × 3 for the convolution operation in Conv4. Both networks are used as baseline methods (refer Section 3.4) to compare with and validate the proposed method.  ----4096  FC2  4096  ----4096  FC3  4096  ----2048  Softmax  2048  ----6 Feature visualization of different convolutional layers in MSCNN is shown in Figure 2, with the visible image only as an example. It can be seen that the number of feature maps in each layer equals the number of filters in Table 1. Different convolution kernels respond differently to various ship image positions. The shallow layer extraction focuses mainly on texture features, with obtained feature maps closer to the original image. Deeper layer extraction focuses more on features such as contours and shapes, which are in general more abstract and representative. The deeper the layer goes, the lower the resolution of the feature map.
Electronics 2020, 9, x FOR PEER REVIEW 5 of 20 Feature visualization of different convolutional layers in MSCNN is shown in Figure 2, with the visible image only as an example. It can be seen that the number of feature maps in each layer equals the number of filters in Table 1. Different convolution kernels respond differently to various ship image positions. The shallow layer extraction focuses mainly on texture features, with obtained feature maps closer to the original image. Deeper layer extraction focuses more on features such as contours and shapes, which are in general more abstract and representative. The deeper the layer goes, the lower the resolution of the feature map. Since the feature maps obtained by different convolution kernels of the same layer are complementary to each other when describing ship images, one can obtain an overall feature map by fusing these individual ones from the layer in a ratio of 1:1. Figures 3 and 4 show comparison of the overall feature map of visible image and corresponding infrared image at CNN's conv4 and MSCNN's conv4. The feature map size is consistent with the output size of conv4 and route layer. From below, it can be seen that MSCNN-based feature extraction responds better to ship area, i.e., Since the feature maps obtained by different convolution kernels of the same layer are complementary to each other when describing ship images, one can obtain an overall feature map by fusing these individual ones from the layer in a ratio of 1:1. Figures 3 and 4 show comparison of the overall feature map of visible image and corresponding infrared image at CNN's conv4 and MSCNN's conv4. The feature map size is consistent with the output size of conv4 and route layer. From below, it can be seen that MSCNN-based feature extraction responds better to ship area, i.e., the yellow area highlighted in the feature map in Figures 3b and 4b is darker and wider than that of map in Figures 3a  and 4a. This would play a positive role in our subsequent fusion classification of visible image features and infrared image features. In addition, it can be seen that the position of strongest response to ship area is different between the visible image feature and infrared image feature. Therefore, the fusion of these two features can effectively use the complementary information within different modal images, enrich the fused information, and improve the ship classification performance.
Electronics 2020, 9, x FOR PEER REVIEW 6 of 20 the yellow area highlighted in the feature map in Figures 3b and 4b is darker and wider than that of map in Figures 3a and 4a. This would play a positive role in our subsequent fusion classification of visible image features and infrared image features. In addition, it can be seen that the position of strongest response to ship area is different between the visible image feature and infrared image feature. Therefore, the fusion of these two features can effectively use the complementary information within different modal images, enrich the fused information, and improve the ship classification performance.
(a) (b)  It is known that features extracted by the deep convolutional layer contain abundant useful information. Furthermore, medium-term feature fusion method has achieved better classification results according to literature [26]. In view of the above, this study focuses on fusion using features obtained by the Max Pooling3 layer of MSCNN. Max Pooling3 layer reduces the dimensionality of the last convolution layer (i.e., Conv4). The two-stream symmetric MSCNN feature extraction module is shown in Figure 5. The visible and infrared image of the same ship object are respective inputs of the two-stream network for feature extraction, and the two-stream features are processed via the concatenated feature fusion layer after the Max Pooling 3 layer to obtain the concatenated fusion feature. In summary, three fully connected layers and one Softmax output layer in the multi-scale CNN are used to classify the fusion features, with Softmax output node equal to 6 (the number of ship types in the VAIS dataset).    It is known that features extracted by the deep convolutional layer contain abundant useful information. Furthermore, medium-term feature fusion method has achieved better classification results according to literature [26]. In view of the above, this study focuses on fusion using features obtained by the Max Pooling3 layer of MSCNN. Max Pooling3 layer reduces the dimensionality of the last convolution layer (i.e., Conv4). The two-stream symmetric MSCNN feature extraction module is shown in Figure 5. The visible and infrared image of the same ship object are respective inputs of the two-stream network for feature extraction, and the two-stream features are processed via the concatenated feature fusion layer after the Max Pooling 3 layer to obtain the concatenated fusion feature. In summary, three fully connected layers and one Softmax output layer in the multi-scale CNN are used to classify the fusion features, with Softmax output node equal to 6 (the number of ship types in the VAIS dataset). It is known that features extracted by the deep convolutional layer contain abundant useful information. Furthermore, medium-term feature fusion method has achieved better classification results according to literature [26]. In view of the above, this study focuses on fusion using features obtained by the Max Pooling3 layer of MSCNN. Max Pooling3 layer reduces the dimensionality of the last convolution layer (i.e., Conv4). The two-stream symmetric MSCNN feature extraction module is shown in Figure 5. The visible and infrared image of the same ship object are respective inputs of the two-stream network for feature extraction, and the two-stream features are processed via the concatenated feature fusion layer after the Max Pooling 3 layer to obtain the concatenated fusion feature. In summary, three fully connected layers and one Softmax output layer in the multi-scale CNN are used to classify the fusion features, with Softmax output node equal to 6 (the number of ship types in the VAIS dataset). The proposed ship classification model in this study also includes a training process and a testing process. During training, the visible image and infrared image of the same object (with the same label) are preprocessed and input into the two-stream network to extract features and conduct simultaneous training. Then the error is calculated between the true class labels and the predicted class labels obtained by the Softmax function. After that, the weight and bias are adjusted by back propagation process to minimize the error. Lastly, the optimal model is saved. In the testing phase, the visible image and infrared image of the same object (with the same label) are also preprocessed and input into the two-stream network to extract features, and call upon the optimal model to test features, and the predicted labels of the ship images are output. The purpose of preprocessing is to randomly crop the training images into 227 pixels × 227 pixels. Such random cropping can not only increase the training data, but also improve the generalization ability of the model. In this study, stochastic gradient descent algorithm (SGD) [27] is used to minimize the cross entropy loss function. The test images are cropped into 227 pixels × 227 pixels by center cropping. We also use data enhancement techniques such as random horizontal flipping and z-score standardization [28] to ensure sample randomness and avoid model overfitting.

Feature Fusion
In this study, the visible image features and infrared image features extracted by the MSCNN are fused, and the effective information of the two features can be combined through feature fusion to obtain a more comprehensive feature representation of the ship object. Common feature fusion [29] methods include additive fusion, maximum fusion and concatenated fusion, etc. The expression of feature fusion can be defined as:  The proposed ship classification model in this study also includes a training process and a testing process. During training, the visible image and infrared image of the same object (with the same label) are preprocessed and input into the two-stream network to extract features and conduct simultaneous training. Then the error is calculated between the true class labels and the predicted class labels obtained by the Softmax function. After that, the weight and bias are adjusted by back propagation process to minimize the error. Lastly, the optimal model is saved. In the testing phase, the visible image and infrared image of the same object (with the same label) are also preprocessed and input into the two-stream network to extract features, and call upon the optimal model to test features, and the predicted labels of the ship images are output. The purpose of preprocessing is to randomly crop the training images into 227 pixels × 227 pixels. Such random cropping can not only increase the training data, but also improve the generalization ability of the model. In this study, stochastic gradient descent algorithm (SGD) [27] is used to minimize the cross entropy loss function. The test images are cropped into 227 pixels × 227 pixels by center cropping. We also use data enhancement techniques such as random horizontal flipping and z-score standardization [28] to ensure sample randomness and avoid model overfitting.

Feature Fusion
In this study, the visible image features and infrared image features extracted by the MSCNN are fused, and the effective information of the two features can be combined through feature fusion to obtain a more comprehensive feature representation of the ship object. Common feature fusion [29] methods include additive fusion, maximum fusion and concatenated fusion, etc. The expression of feature fusion can be defined as: where X and Y represent the visible image features and infrared image features extracted by the two-stream symmetric MSCNN respectively. F indicates the fusion feature, X, Y, F ∈ R HWC . H, W and C indicate the height, width and number of channels of the feature map respectively. Additive fusion is achieved by adding element values at the corresponding positions of the two feature maps, with total number of channels in the fusion feature map unchanged. If a visible image Electronics 2020, 9,2022 8 of 20 feature is denoted as X = [x 1 , x 2 , . . . , x n ], and infrared image feature is denoted as Y = [y 1 , y 2 , . . . , y n ], then additive fusion can be denoted as: Maximum fusion is to take the element of higher value at the corresponding position of the two feature maps as the fusion result, which can be expressed as: It should be noted that both additive fusion and maximum fusion are only applicable to feature maps fusion of same dimension.
On the other hand, concatenated fusion connects two feature maps directly, which can be applied to feature maps of any dimension. The total number of fusion feature channels is the sum of all visible image feature channels and infrared feature channels. Concatenated fusion can be expressed as: Through experimental comparison, it can be seen that concatenated fusion of the visible image features and infrared image features can achieve better classification effect (refer Section 3.4.1), and concatenated fusion retains all the elements of the feature maps, this study adopts the concatenated fusion method. From Table 1, it can be seen that the output size of each feature map after Max Pooling3 is 6 × 6 × 512, where 512 represents the number of channels. Hence the size of the feature map after concatenated fusion is 6 × 6 × 1024. The detailed concatenated fusion process is shown in Figure 6.
Maximum fusion is to take the element of higher value at the corresponding position of the two feature maps as the fusion result, which can be expressed as: It should be noted that both additive fusion and maximum fusion are only applicable to feature maps fusion of same dimension.
On the other hand, concatenated fusion connects two feature maps directly, which can be applied to feature maps of any dimension. The total number of fusion feature channels is the sum of all visible image feature channels and infrared feature channels. Concatenated fusion can be expressed as: Through experimental comparison, it can be seen that concatenated fusion of the visible image features and infrared image features can achieve better classification effect (refer Section 3.4.1), and concatenated fusion retains all the elements of the feature maps, this study adopts the concatenated fusion method. From Table 1, it can be seen that the output size of each feature map after Max Pooling3 is 6 × 6 × 512, where 512 represents the number of channels. Hence the size of the feature map after concatenated fusion is 6 × 6 × 1024. The detailed concatenated fusion process is shown in Figure 6.

Feature Fusion Layer Based on Attention Mechanism
Attention mechanism (AM) [30] is a cognitive mechanism that mimics the human brain. In visual perception, it pays attention mainly to the features of interest, and suppresses redundant information. The attention mechanism can be integrated into the CNN framework with negligible overhead, and trained together with CNN [31]. Inspired by the convolutional block attention module (CBAM) [32] which allows auto-learning of pixel correlation among different feature maps, we add the attention mechanism after feature fusion layer. By combining the learned attention weights and the original concatenated fusion features, the model can greatly enhance local details of ship images, thereby improving the representation ability of the existing fusion feature maps. Subsequent experiments also prove that the addition of such module can improve the ship classification performance.
The structure diagram of the attention mechanism and feature fusion module is shown in Figure 7. It can be seen from Figure 7 that the attention mechanism includes a channel attention module and a spatial attention module, where the two modules are connected in series. Concatenated fusion features are used as the input of channel attention, and the channel attention weight is calculated and multiplied with the concatenated fusion feature to each channel to obtain the feature map of the channel attention module, which is then used as the input of spatial attention. After spatial attention weights being obtained, it is also multiplied with the feature map of the channel attention module to arrive at the final attention mechanism feature map. To avoid features loss and performance degradation, the attention mechanism feature map and the concatenated fusion feature are added element by element to achieve the final refined feature.

Feature Fusion Layer Based on Attention Mechanism
Attention mechanism (AM) [30] is a cognitive mechanism that mimics the human brain. In visual perception, it pays attention mainly to the features of interest, and suppresses redundant information. The attention mechanism can be integrated into the CNN framework with negligible overhead, and trained together with CNN [31]. Inspired by the convolutional block attention module (CBAM) [32] which allows auto-learning of pixel correlation among different feature maps, we add the attention mechanism after feature fusion layer. By combining the learned attention weights and the original concatenated fusion features, the model can greatly enhance local details of ship images, thereby improving the representation ability of the existing fusion feature maps. Subsequent experiments also prove that the addition of such module can improve the ship classification performance.
The structure diagram of the attention mechanism and feature fusion module is shown in Figure 7. It can be seen from Figure 7 that the attention mechanism includes a channel attention module and a spatial attention module, where the two modules are connected in series. Concatenated fusion features are used as the input of channel attention, and the channel attention weight is calculated and multiplied with the concatenated fusion feature to each channel to obtain the feature map of the channel attention module, which is then used as the input of spatial attention. After spatial attention weights being obtained, it is also multiplied with the feature map of the channel attention module to arrive at the final attention mechanism feature map. To avoid features loss and performance degradation, the attention mechanism feature map and the concatenated fusion feature are added element by element to achieve the final refined feature.
Electronics 2020, 9, x FOR PEER REVIEW 9 of 20 Figure 7. Structure diagram of attention mechanism and feature fusion module. ⊗ denotes element-wise multiplication, ⊕ denotes element-wise addition.

Channel Attention Module
The channel attention module establishes a weight map to evaluate the importance of each channel. The channel that contains more important information has higher weight, and vice versa. It focuses on "what" it views as meaningful. The channel attention module is shown in Figure 8.
where σ denotes the Sigmoid activation function, ⊗ denotes element-wise multiplication.

Spatial Attention Module
Spatial attention module obtains the weight map of features in spatial dimension, which focuses on "where" useful information can be found, supplementing channel attention. The details of the spatial attention module is shown in Figure 9.

Channel Attention Module
The channel attention module establishes a weight map to evaluate the importance of each channel. The channel that contains more important information has higher weight, and vice versa. It focuses on "what" it views as meaningful. The channel attention module is shown in Figure 8.

Channel Attention Module
The channel attention module establishes a weight map to evaluate the importance of each channel. The channel that contains more important information has higher weight, and vice versa. It focuses on "what" it views as meaningful. The channel attention module is shown in Figure 8.
where σ denotes the Sigmoid activation function, ⊗ denotes element-wise multiplication.

Spatial Attention Module
Spatial attention module obtains the weight map of features in spatial dimension, which focuses on "where" useful information can be found, supplementing channel attention. The details of the spatial attention module is shown in Figure 9.

Channel Attention Module
The channel attention module establishes a weight map to evaluate the importance of each channel. The channel that contains more important information has higher weight, and vice versa. It focuses on "what" it views as meaningful. The channel attention module is shown in Figure 8.
where σ denotes the Sigmoid activation function, ⊗ denotes element-wise multiplication.

Spatial Attention Module
Spatial attention module obtains the weight map of features in spatial dimension, which focuses on "where" useful information can be found, supplementing channel attention. The details of the spatial attention module is shown in Figure 9. Firstly, concatenated fusion feature F serves as input feature with dimension H × W × C. Both average pooling and maximum pooling are then applied to input features to aggregate spatial information of the feature map, and the channel attention descriptors F c avg and F c max of dimension 1 × 1 × C are obtained. These two channel attention descriptors are fed into a multi-layer perceptron (MLP) with a hidden layer. To reduce number of parameters used, the activation size of the hidden layer is R C/r×1×1 , in which r is the compression ratio. In this study, r is set to 8 (refer Section 3.4.2). The features of the two parallel branches are then added and processed by sigmoid activation function to obtain the final channel attention weight map M c . Finally, multiplication between the channel attention weight map M c and the original concatenated fusion feature F is done to obtain the channel-refined feature F . The calculation process of channel attention weight map M c and channel-refined feature F can be expressed as: where σ denotes the Sigmoid activation function, ⊗ denotes element-wise multiplication.

Spatial Attention Module
Spatial attention module obtains the weight map of features in spatial dimension, which focuses on "where" useful information can be found, supplementing channel attention. The details of the spatial attention module is shown in Figure 9.
From the above analysis, we can see that both average pooling and maximum pooling are used in the attention mechanism. Average pooling takes the average information on each channel, whereas maximum pooling only considers the most significant information of each channel of the feature map. Through the combination of the two pooling operations, the attention mechanism is able to focus on the important channel and spatial feature information of the ship image, and filter out the unimportant feature information. As a result, more discriminative features can be obtained to improve ship classification performance.

Experimental Environment and Parameter Setting
The experimental environment used in this study is computer with Intel(R) Core(TM) i9-7980XE@2.6GHz processor, GPU of NVIDIA TITAN Xp Pascal and 32GB memory. Python3.5 and deep learning open source framework Pytorch programming are used to perform all experiments.
Experimental parameter settings are as follows. The batch size is set to 32, and learning rate at 0.001. The optimization method is the stochastic gradient descent (SGD) algorithm. The momentum parameter is set to 0.9, with weight coefficient 0.0001. The dropout is at 0.5, and learning epochs of 500.

Experimental Dataset
The dataset used in this study is the VAIS dataset [24], which is the only available public dataset of paired visible and long wave infrared ship images. These images were captured using a multimodal stereo camera rig on harbours. The RGB global shutter camera was a ISVI IC-C25. The long wave infrared camera was Sofradir-EC Atom 1024, which has a spectral range of 8-12 microns. The cameras are tightly mounted next to each other and checked to ensure no interference. The dataset consists of 2865 images (1623 visible images and 1242 infrared images) including 1088 paired visible and infrared images. There are a total of 154 nighttime infrared images. The number of unique ships is 264. For most ships, only one orientation was captured; for a few, up to 5 to 7 Firstly, the input feature is F of dimension H × W × C. Maximum pooling and average pooling are performed in parallel on the input feature in the channel dimension to obtain two descriptors F s avg and F s max of dimension H × W × 1, which are then concatenated. After that, the concatenated fusion descriptor is convolved by the convolution kernels of size 3 × 3 and processed by the sigmoid activation function to obtain the spatial attention weight map M s . Finally, the spatial attention weight map M s and F are multiplied to obtain the features F " after spatial attention. The calculation process of spatial attention weight map M s and the feature F " after spatial attention can be expressed as: where f 3×3 denotes convolution kernels of size 3 × 3.
From the above analysis, we can see that both average pooling and maximum pooling are used in the attention mechanism. Average pooling takes the average information on each channel, whereas maximum pooling only considers the most significant information of each channel of the feature map. Through the combination of the two pooling operations, the attention mechanism is able to focus on the important channel and spatial feature information of the ship image, and filter out the unimportant feature information. As a result, more discriminative features can be obtained to improve ship classification performance.

Experimental Environment and Parameter Setting
The experimental environment used in this study is computer with Intel(R) Core(TM) i9-7980XE@2.6GHz processor, GPU of NVIDIA TITAN Xp Pascal and 32GB memory. Python3.5 and deep learning open source framework Pytorch programming are used to perform all experiments.
Experimental parameter settings are as follows. The batch size is set to 32, and learning rate at 0.001. The optimization method is the stochastic gradient descent (SGD) algorithm. The momentum parameter is set to 0.9, with weight coefficient 0.0001. The dropout is at 0.5, and learning epochs of 500.

Experimental Dataset
The dataset used in this study is the VAIS dataset [24], which is the only available public dataset of paired visible and long wave infrared ship images. These images were captured using a multimodal stereo camera rig on harbours. The RGB global shutter camera was a ISVI IC-C25. The long wave infrared camera was Sofradir-EC Atom 1024, which has a spectral range of 8-12 microns. The cameras are tightly mounted next to each other and checked to ensure no interference. The dataset consists of 2865 images (1623 visible images and 1242 infrared images) including 1088 paired visible and infrared images. There are a total of 154 nighttime infrared images. The number of unique ships is 264. For most ships, only one orientation was captured; for a few, up to 5 to 7 orientations. This way, we avoid duplicates in the dataset. The image format is png. The dataset can be divided into 6 coarse-grained categories, namely Medium "other" ships, Merchant ships, Medium passenger ships, Sailing ships, Small boats and Tugboats, as shown in Figure 10. It can be seen that the background of the ship is complex, with uneven illumination and various size. The area of visible bounding boxes ranges from 644 to 4,478,952 pixels, with mean of 181,319 pixels and median 9983 pixels. The area of infrared bounding boxes ranges from 594 to 137,240 pixels, with mean of 8544 pixels and median 1610 pixels. In this study, only 1088 pairs of visual and infrared images are chosen for experiment purposes. According to the number of "official" training and test sample. A total of 539 pairs are randomly selected as training images, and the remaining 549 pairs are test images. The number of samples in the training set and test set are listed in Table 2 orientations. This way, we avoid duplicates in the dataset. The image format is png. The dataset can be divided into 6 coarse-grained categories, namely Medium "other" ships, Merchant ships, Medium passenger ships, Sailing ships, Small boats and Tugboats, as shown in Figure 10. It can be seen that the background of the ship is complex, with uneven illumination and various size.

Evaluation Metrics
The evaluation metrics of ship image classification results adopted by this study include classification accuracy, F1-score and average feature extraction time consumption per image.
Classification accuracy is defined as the ratio of correctly classified samples to the total number of samples. The higher the ratio is, the better the classification performance. The classification accuracy can be expressed as: where TP, FP, TN and FN denote the number of true positives, false positives, true negatives and false negatives, respectively. F1-score is a comprehensive measure of classification performance, which is the weighted average of precision ratio and recall ratio. The maximum F1-score is 1 and the minimum is 0. It can be defined as: Precision ratio is calculated using the number of true positives divided by predicted positive samples. Recall ratio is computed by taking the number of true positives divided by all positive samples. Precision and recall ratios can be expressed as: To further analyze the classification results, the confusion matrix was used to visualize them. The confusion matrix represents the mistakes caused by the classifier when dealing with multi-class problems. The horizontal axis represents the predicted category and the vertical axis gives the true category. Hence diagonal elements are those correctly classified ship images of each type. The diagonal element of the normalized confusion matrix represents classification accuracy achieved by each ship type.

Classification Performance Comparison
To validate its classification performance, the proposed method was compared with the baseline method and other state-of-the-art methods under the same experimental conditions. Table 3 lists out evaluation metrics, namely classification accuracy and average feature extraction time consumption per image, and their respective results when the baseline method, feature fusion method and proposed method are applied to VAIS dataset. Herein, the baseline methods are CNN and MSCNN. Herein, CNN represents Conv4 with only 256 convolution kernels of size 3 × 3. CNN_AFF refers to additive feature fusion of infrared image features and visible image features extracted by CNN. CNN_CFF and MSCNN_CFF represent concatenated feature fusion of infrared image features and visible image features extracted by CNN and MSCNN, respectively. CNN_CFF_SE and MSCNN_CFF_SE denote applying the channel attention mechanism in literature [22] to concatenated fusion features. CNN_CFF_AM and MSCNN_CFF_AM indicate the proposed attention mechanism being applied to process concatenated fusion features. It can be observed that classification accuracy of visible images is higher than that of infrared images, mainly due to lower resolution and less texture information of infrared images. MSCNN has higher classification accuracy than CNN for both visible and infrared images, which indicates that the proposed MSCNN can extract more detailed information with enriching ship image features. Classification accuracy using feature fusion methods is higher than the baseline method, which means that fusing visible image features and infrared image features can complement information from multiple sources and improve classification performance. Furthermore, CNN_CFF attains higher classification accuracy than CNN_AFF, as concatenated feature fusion overcomes information offset caused by addition of elements in spatial dimension. Therefore, we did not conduct additive fusion experiments for MSCNN method. It can be seen that the classification accuracy of the CNN_CFF_SE, MSCNN_CFF_SE, CNN_CFF_AM and MSCNN_CFF_AM methods is higher than that of the feature fusion method, and the MSCNN_CFF_AM method (the proposed method) achieved the highest classification accuracy. This can be explained as fused features after attention mechanism modification can highlight key local features, suppress useless features, and significantly enhance feature expression. The SE (squeeze and excitation) module in literature [22] focuses on the channel information of feature map, and ignores the importance of spatial location. The attention mechanism in this study combines channel attention and spatial attention modules, so that each of the modules can learn "what" and "where" to see in the channel and spatial dimensions. Hence, the proposed method can achieve better classification performance. Table 3. Classification accuracy (%) and average time consumption for feature extraction per image (ms) of the proposed method, baseline methods and feature fusion methods on VAIS dataset.  Figure 11 depicts the average feature extraction time consumption per image of the proposed method, baseline methods and feature fusion methods on the VAIS dataset. It can be seen that in the proposed method, the additions of feature fusion and attention mechanism bring the average time consumption per image slightly higher than that of the baseline method; however, the former has obvious advantages in classification accuracy and the time consumption of 0.140 ms was also relatively short. method, baseline methods and feature fusion methods on the VAIS dataset. It can be seen that in the proposed method, the additions of feature fusion and attention mechanism bring the average time consumption per image slightly higher than that of the baseline method; however, the former has obvious advantages in classification accuracy and the time consumption of 0.140 ms was also relatively short. Classification accuracy obtained per ship class using baseline methods, feature fusion methods and proposed method on VAIS dataset are listed in Table 4. As observed, the proposed method demonstrated best classification results compared to the other methods. The classification accuracy of CNN_CFF_AM is better than that of the CNN method for each type. The classification accuracy of the proposed method is higher than that of MSCNN for all ship types except Merchant. Overall, the proposed method has the highest classification accuracy for medium-passenger, small and tug. The classification accuracy of the proposed method for medium-other and merchant is slightly lower than that of MSCNN_CFF_SE, but the classification accuracy of the proposed method for medium-passenger is 6.78% higher than that of MSCNN_CFF_SE. The classification accuracy for tugs is 100%, because of the larger difference in appearance between the tug and other ship types, and the better image quality of the tugs in VAIS dataset. In conclusion, the proposed method achieved the highest overall classification accuracy on the VAIS dataset. Classification accuracy obtained per ship class using baseline methods, feature fusion methods and proposed method on VAIS dataset are listed in Table 4. As observed, the proposed method demonstrated best classification results compared to the other methods. The classification accuracy of CNN_CFF_AM is better than that of the CNN method for each type. The classification accuracy of the proposed method is higher than that of MSCNN for all ship types except Merchant. Overall, the proposed method has the highest classification accuracy for medium-passenger, small and tug. The classification accuracy of the proposed method for medium-other and merchant is slightly lower than that of MSCNN_CFF_SE, but the classification accuracy of the proposed method for medium-passenger is 6.78% higher than that of MSCNN_CFF_SE. The classification accuracy for tugs is 100%, because of the larger difference in appearance between the tug and other ship types, and the better image quality of the tugs in VAIS dataset. In conclusion, the proposed method achieved the highest overall classification accuracy on the VAIS dataset. In order to verify the classification capability of the proposed method, we compare the F1-scores based on baseline, feature fusion and proposed method, as shown in Table 5. It can be seen that the proposed method achieves the highest average F1-score among 6 ship types. CNN_CFF_SE gives best F1-score for merchant, and CNN_CFF_AM beats the rest for sailing. However the proposed one gives the highest F1-score for all other four ship types, and is doing almost equally well as CNN_CFF_AM for sailing. This can be largely attributed to the addition of feature fusion and attention mechanisms in the proposed method, in which complementary information from multi-source images are effectively utilized and more characteristic features are properly extracted. It greatly enhances overall model classification capability. To further verify the effectiveness of the proposed method, we compare the proposed method with other state-of-the-art methods developed in recent years. We reimplement the state-of-the-art methods on the VAIS dataset, and the partition method of the dataset is consistent with that of the proposed method. The results are shown in Tables 6-8. Table 6 gives out classification accuracy of different methods on the VAIS dataset. Table 7 illustrates the classification accuracy of each class with different methods on VAIS dataset. Table 8 lists the F1-score using different methods on VAIS dataset. Among them, traditional methods (HOG + SVM, LBP + SVM), AlexNet, Method [33], Method [10] and Method [14] only process either visible images or infrared images of a single band, whereas Method [19] uses two parallel CNNs to extract the features of visible images and infrared images, respectively, and classifies them after feature fusion in the last fully connected layer. It can be seen from Table 6 that, compared with other methods, the proposed one achieves the best classification accuracy on both single-band and multi-source images. It can be seen from Table 7 that the proposed method had the highest classification accuracy for medium-passenger, sailing, small and tug. Although for Medium-other category, the proposed method falls slightly behind Method [14] in classification accuracy, it still beats all other methods. It can be seen from Table 8 that the average F1-score of the proposed method is the highest. Method [19] had the highest F1-score for Merchant, and the proposed method had the highest F1-score for all five other types. It can be concluded that the proposed method achieves overall best classification performance, which is due to effective extraction and fusion of visible image and infrared images and inclusion of the attention mechanism.

Confusion Matrix and Confusion Matrix Normalization of the Classification Results
Although compared with other methods, the proposed method improves the classification performance greatly, there are still some cases of misclassification. The confusion matrix and its normalization process using the VAIS dataset are depicted in Figure 12. It can be seen that initially 12 Medium-other ships were misjudged as Small, with an inter-class error of 15.9%, and 3 Medium-passenger ships were misjudged as Small, with an inter-class error of 5.1%, indicating that both Medium-other and Medium-passenger are the classifications that can be easily confused with Small. This is not surprising as their shapes resemble each other closely, especially when the image resolution is low, as shown in Figure 10. Figure 13 illustrates misclassified examples of Medium-passenger ships and Small ships. We carry out experiments with different r in order to find the optimal r suitable for the attention mechanism. Figure 14 shows classification accuracy and average feature extraction time consumption per image using CNN_CFF_AM (refer Section 3.4.2) method under different r. The purpose of r is to reduce the number of parameters used. Since we design the attention mechanism in the feature fusion layer to be lightweight, the difference in the number of parameters caused by different compression rates is ignored. In Figure 14, it can be seen that when r is 8, a good balance is achieved between classification accuracy and average time consumption for feature extraction per image. As such, r is set to 8 in this study.

Confusion Matrix and Confusion Matrix Normalization of the Classification Results
Although compared with other methods, the proposed method improves the classification performance greatly, there are still some cases of misclassification. The confusion matrix and its normalization process using the VAIS dataset are depicted in Figure 12. It can be seen that initially 12 Medium-other ships were misjudged as Small, with an inter-class error of 15.9%, and 3 Medium-passenger ships were misjudged as Small, with an inter-class error of 5.1%, indicating that both Medium-other and Medium-passenger are the classifications that can be easily confused with Small. This is not surprising as their shapes resemble each other closely, especially when the image resolution is low, as shown in Figure 10. Figure 13 illustrates misclassified examples of Medium-passenger ships and Small ships.

Confusion Matrix and Confusion Matrix Normalization of the Classification Results
Although compared with other methods, the proposed method improves the classification performance greatly, there are still some cases of misclassification. The confusion matrix and its normalization process using the VAIS dataset are depicted in Figure 12. It can be seen that initially 12 Medium-other ships were misjudged as Small, with an inter-class error of 15.9%, and 3 Medium-passenger ships were misjudged as Small, with an inter-class error of 5.1%, indicating that both Medium-other and Medium-passenger are the classifications that can be easily confused with Small. This is not surprising as their shapes resemble each other closely, especially when the image resolution is low, as shown in Figure 10. Figure 13 illustrates misclassified examples of Medium-passenger ships and Small ships.  We carry out experiments with different r in order to find the optimal r suitable for the attention mechanism. Figure 14 shows classification accuracy and average feature extraction time consumption per image using CNN_CFF_AM (refer Section 3.4.2) method under different r . The purpose of r is to reduce the number of parameters used. Since we design the attention mechanism in the feature fusion layer to be lightweight, the difference in the number of parameters caused by different compression rates is ignored. In Figure 14, it can be seen that when r is 8, a good balance is achieved between classification accuracy and average time consumption for feature extraction per image. As such, r is set to 8 in this study.

Conclusions
In this study, the authors proposed the use of an attention mechanism and MSCNN method for accurate ship classification. Firstly, a two-stream symmetric MSCNN is adopted to extract the

Conclusions
In this study, the authors proposed the use of an attention mechanism and MSCNN method for accurate ship classification. Firstly, a two-stream symmetric MSCNN is adopted to extract the features of visible images and infrared images, and the two features are concatenated such that complementary features can be effectively utilized. After that, the attention mechanism is applied to the concatenated fusion layer to obtain more effective feature representation. Lastly, fused features after attention mechanism modification are sent to fully connected layers and the Softmax output layer to obtain the final classification result. In order to verify the effectiveness of the proposed method, we conduct experiment on the VAIS dataset. The results show that, compared with existing methods, the proposed method can achieve better classification performance, with a classification accuracy of 93.81%. Results from F1-score and confusion matrix further validate the effectiveness of the proposed method. However, in the presence of high intra-class similarity, the proposed method still results in some degree of misclassification, and increases the average feature extraction time consumption per image slightly. In future research, we will consider exploring and researching how to select the fused features while maintaining high classification accuracy to improve the efficiency of the method.