FCNet: Flexible Convolution Network for Infrared Small Ship Detection

Guo, Feng; Ma, Hongbing; Li, Liangliang; Lv, Ming; Jia, Zhenhong

doi:10.3390/rs16122218

Open AccessArticle

FCNet: Flexible Convolution Network for Infrared Small Ship Detection

by

Feng Guo

^1,2,

Hongbing Ma

^1,2,3,*

,

Liangliang Li

⁴

,

Ming Lv

^1,2 and

Zhenhong Jia

^1,2

¹

School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China

²

Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830046, China

³

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

⁴

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(12), 2218; https://doi.org/10.3390/rs16122218

Submission received: 26 April 2024 / Revised: 22 May 2024 / Accepted: 13 June 2024 / Published: 19 June 2024

(This article belongs to the Topic Ship Dynamics, Stability and Safety)

Download

Browse Figures

Versions Notes

Abstract

The automatic monitoring and detection of maritime targets hold paramount significance in safeguarding national sovereignty, ensuring maritime rights, and advancing national development. Among the principal means of maritime surveillance, infrared (IR) small ship detection technology stands out. However, due to their minimal pixel occupancy and lack of discernible color and texture information, IR small ships have persistently posed a formidable challenge in the realm of target detection. Additionally, the intricate maritime backgrounds often exacerbate the issue by inducing high false alarm rates. In an effort to surmount these challenges, this paper proposes a flexible convolutional network (FCNet), integrating dilated convolutions and deformable convolutions to achieve flexible variations in convolutional receptive fields. Firstly, a feature enhancement module (FEM) is devised to enhance input features by fusing standard convolutions with dilated convolutions, thereby obtaining precise feature representations. Subsequently, a context fusion module (CFM) is designed to integrate contextual information during the downsampling process, mitigating information loss. Furthermore, a semantic fusion module (SFM) is crafted to fuse shallow features with deep semantic information during the upsampling process. Additionally, squeeze-and-excitation (SE) blocks are incorporated during upsampling to bolster channel information. Experimental evaluations conducted on two datasets demonstrate that FCNet outperforms other algorithms in the detection of IR small ships on maritime surfaces. Moreover, to propel research in deep learning-based IR small ship detection on maritime surfaces, we introduce the IR small ship dataset (Maritime-SIRST).

Keywords:

infrared small ship detection; dilated convolution; deformable convolution; deep learning

Graphical Abstract

1. Introduction

Ships, as primary entities in safeguarding maritime interests, have garnered continuous attention [1]. The detection and classification of such ships furnish strategic decision-makers with indispensable insights, effectively truncating decision cycles, and serve as a pivotal foundation for achieving real-time situational awareness in expansive operational environments [2]. With the advancement of remote sensing imaging technologies, the surveillance of vast maritime domains has become feasible, consequently garnering widespread attention to maritime target detection and classification techniques [3]. In civilian domains, this forms the bedrock for implementing marine resource regulation, illegal fishing monitoring, and aiding maritime rescue operations [4], while in military domains, it finds utility in patrolling territorial waters, safeguarding maritime rights, and monitoring critical port facilities [5]. In recent years, with the exponential augmentation of computational prowess, deep learning has witnessed accelerated development [6,7]. The application of deep learning-based target detection algorithms in maritime target detection not only economizes on human and material resources but also amplifies efficiency, thereby playing a pivotal role in ensuring maritime security [8].

Infrared (IR) imaging technology has all-weather work, excellent concealment, wide detection range, and anti-electromagnetic interference ability; therefore, IR small target detection technology is one of the main means of marine monitoring. When the distance between IR detectors and targets exceeds 10 km, or even several tens of kilometers, the coverage area of the target typically diminishes [9]. This phenomenon results in relatively diminutive ship dimensions in IR images, escalating the demands on detection algorithms and rendering the prevailing algorithms incapable of achieving satisfactory outcomes [10]. Unlike conventional IR small target detection, maritime ship detection confronts multifarious challenges. As depicted in Figure 1, atmospheric scattering and refraction, optical focusing, and various forms of noise precipitate a low signal-to-noise ratio (SNR) in IR imagery, with scant texture detail, resulting in weakened target signals and diminished contrast against backgrounds [11,12]. Additionally, maritime backgrounds are rife with myriad clutter interferences. Strong radiation clutter generated by phenomena such as waves and fish-scale ripples bears resemblance to ship shapes, thereby easily impeding ship detection [13]. Simultaneously, radiation-intense phenomena such as cloud clusters and islands formed by seawater evaporation further impede ship detection. Moreover, maritime ships typically appear in fleets, engendering the presence of multiple ships in images, thus amplifying detection complexity. Consequently, maritime ship detection harbors distinctive characteristics and challenges in IR image processing.

To detect IR small targets, several conventional methods have been proposed, including filter-based, local information-based, and data structure-based algorithms. Filter-based methods encompass a wide array of techniques, including spatial domain filters and transform domain filters. The former predicts backgrounds in spatial domains to accentuate targets [14], while the latter investigates target correlation properties in frequency domains [15]. However, filter-based algorithms can only suppress uniform backgrounds and fail to mitigate complex background noise. Local information-based methods leverage the characteristics of the target and its local region’s grayscale and brightness variations [16], but are susceptible to overlooking dim targets and prone to false alarms from high-contrast noise. Data structure-based methods evolve primarily from the differential data structures between targets and backgrounds [17], accommodating low signal-to-clutter ratios (SCR) in IR imagery yet still manifesting high false alarm rates in images containing small targets and shape variations amidst complex backgrounds [18]. These conventional IR small target detection algorithms hinge on robust prior knowledge, which poses challenges in handling complex backgrounds. In recent years, with the advent of deep learning-based algorithms, the field of IR small target detection has achieved leapfrog development, marked by a significant improvement in detection accuracy [19].

Deep learning-based detection algorithms can be classified into those based on target detection strategies and those based on semantic segmentation. Algorithms based on target detection strategies further branch into two-stage, one-stage, and anchor-free algorithms. Two-stage algorithms encompass candidate generation, feature extraction, and classification regression as key steps [20]. Although two-stage algorithms exhibit high precision, they tend to be slower. One-stage algorithms predict target categories and positions directly on the original image [21], while anchor-free algorithms reduce model parameter volume [22]. Algorithms based on semantic segmentation classify images by pixels to obtain position and contour information. Semantic segmentation-based algorithms include fully convolutional networks, encoder-decoder architectures, and attention mechanism algorithms. Fully convolutional networks extract features by replacing the final, fully connected layer with convolutional layers [23]. Encoder-decoder architectures utilize downsampling to extract features and upsampling to restore resolution [24]. Attention mechanism algorithms enhance detection accuracy and efficiency by directing network attention to critical areas in images through training [25,26]. This paper focuses on investigating maritime IR small ship target detection using an encoder-decoder structure semantic segmentation algorithm, with the following primary contributions:

Proposing FCNet, a network tailored for maritime IR small ship detection, exhibiting superior precision performance compared to other prominent algorithms.
Introducing an FEM to enhance input image features before encoding, thereby acquiring superior features.
Devising a CFM to fuse contextual information during encoding, balancing local and global information while mitigating target edge information loss.
Introducing an SFM in the decoding process to connect shallow features containing position and texture information with deep semantic information through skip connections, facilitating multiscale feature fusion, thereby retaining critical image information and enhancing detection accuracy.
Proposing the Maritime-SIRST dataset, derived from remote sensing satellite IR band images of complex maritime scenes, to meet the requirements of this research and foster development in related fields.

The structure of this paper is as follows: Section 2 provides a brief overview of related works. In Section 3, the network structure of FCNet is detailed, along with the introduction of the Maritime-SIRST maritime IR ship detection dataset we constructed. Section 4 presents comparative experiments, ablation experimental results, and discussions. Finally, Section 5 offers conclusions.

2. Related Work

2.1. IR Small Target Detection Algorithm Based on Deep Learning

The application scope of deep learning-based IR small target detection algorithms is extensive, marking it as one of the hotspots in the field of computer vision in recent years. Wang et al. [27] proposed an MDvsFA-cGAN model, which divides the IR small target segmentation problem into two subtasks: The suppression of missed detection rate (MD) and false alarm rate (FA), addressing both tasks jointly through adversarial learning. This novel learning framework enables the decoupling of MD and FA suppression, allowing the design of different models more suitable for each subtask and providing dedicated balance for MD and FA to reduce their ratios. Dai et al. [28] introduced an asymmetric context modulation (ACM), which supplements bottom-up local attention modulation and top-down global attention modulation. This approach asymmetrically modulates the interaction between low-level features and high-level features to address the problem of target scale mismatch between general datasets and real scenes. Li et al. [29] proposed a dense nested attention network (DNANet), achieving progressive interaction between high-level and low-level semantic information through the design of a dense nested interactive module (DNIM). Through repeated interactions in DNIM, information of deep IR small targets can be preserved. Building upon DNIM, a cascaded channel and spatial attention module (CSAM) is further proposed to adaptively enhance multi-level features. With DNA-Net, contextual information of small targets can be fully integrated and utilized through repeated fusion and enhancement. Pan et al. [30] introduced an attention with bilinear correlation network (ABCNet), employing a transformer architecture, and proposed a convolutional linear fusion transformer (CLFT) module along with a feature extraction and fusion module with a new attention mechanism, effectively enhancing target features and suppressing noise. Additionally, ABC-Net proposed a u-shaped convolutional dilated convolution (UCDC) module located deep within the network, leveraging the smaller resolution features of deeper layers to obtain finer semantic information.

Currently, research on IR small target detection algorithms based on semantic segmentation has achieved significant progress and has shown excellent performance and great potential. However, there are still problems, such as a high false alarm rate and missed detection rate, in the task of detecting IR small ships on complex sea backgrounds.

2.2. Dilated Convolution and Deformable Convolution

Dilated convolutions [31] expand the coverage of convolution kernels by inserting gaps between standard convolutional units, thereby enhancing the network’s receptive field without compromising resolution. As illustrated in Figure 2, the essence of dilated convolutions lies in a hyperparameter called dilation rate, defining the number of gaps between points in the convolution kernel. When the dilation rate is 1, dilated convolutions degenerate into standard convolution operations. With increasing dilation rates, the coverage area of the convolutional kernel grows exponentially. For instance, a 3 × 3 convolutional kernel with a dilation rate of 2 covers an area equivalent to that of a 5 × 5 kernel, without incurring additional computational overhead.

Standard convolutions can only capture features with fixed shapes (e.g., 3 × 3, 5 × 5), whereas targets in practical scenarios often exhibit irregular shapes. As depicted in Figure 3, deformable convolutions [32,33] offer flexible sampling offsets, dynamically learning appropriate receptive fields that can vary from long to short distances. They can also adaptively adjust sampling offsets and modulation scalars based on input data, achieving adaptive spatial aggregation akin to the vision transformer (ViT) [34], thereby reducing the over-inductive bias of regular convolutions. In summary, deformable convolutions, by learning target shape information, enable the acquisition of more flexible, shape-adaptive semantic information.

2.3. IR Small Ship Detection Dataset

In the current landscape, the scarcity of public datasets remains a pivotal constraint in the advancement of IR small target detection techniques based on deep learning. Over recent years, several public IR small target datasets have gradually emerged. NUST-SIRST [27], a synthetic IR small target dataset, comprises 10,000 training images and 100 test images. This dataset amalgamates authentic IR images with synthetically generated ones. SIRST [28], on the other hand, handpicks representative frames from hundreds of IR small target sequences and annotates them manually in five distinct annotation formats, facilitating training across diverse machine learning methodologies. With a total of 427 images encompassing 480 instances, the backgrounds in the images span marine surfaces, urban settings, skies, fields, and more. NUDT-SIRST [29] is another synthetic dataset, housing 1327 images sized at 256 × 256. Encompassing five primary background scenes, including urban, rural, specular, marine, and cloudy, the dataset features targets such as aircraft, ships, and drones, among others. The IRSTD-1K [35] dataset comprises 1000 IR images captured by IR cameras in real-world settings, each sized at 512 × 512. Spanning diverse environments such as oceans, rivers, fields, mountains, urban areas, and clouds, this dataset exhibits pronounced clutter and noise.

The public release of these datasets has significantly bolstered the development of IR small target detection based on deep learning. However, given the unique characteristics of marine scenes, there remains a pressing need for more IR small target detection datasets tailored to marine environments to better align with the demands of relevant research and applications. The ISDD [36] dataset is one such publicly available dataset, specializing in IR small target detection within marine environments. Comprising 1284 IR remote sensing images capturing 3061 instances of ships, each image sized at 500 × 500, these images are sourced from Landsat8 satellite imagery, integrating three OLI bands—bands 7, 5, and 4—to generate short-wave IR images. Additionally, the NUDT-SIRST-SEA [37] dataset stands as the premier dataset focusing on satellite-based marine IR ship detection. This dataset comprises 48 real images captured by sensors on low-orbit satellites, featuring 17,598 min ship targets, with each image pixelated at 10,000 × 10,000.

Despite the existence of two public datasets dedicated to marine environments, analysis reveals that the ISDD dataset exhibits relatively simplistic backgrounds, with over 80% of scenes being cloud-free. In such instances, targets are easily discerned, making it challenging to accurately reflect the complexity inherent to marine environments. In the NUDT-SIRST-SEA dataset, a significant portion of the background is similarly simplistic, with over 50% of images devoid of targets. Furthermore, this dataset contains a considerable influx of IR images depicting coastal terrestrial scenes, thereby compromising the overall dataset quality.

3. Materials and Methods

3.1. Structure of FCNet

The model proposed in this paper is based on an encoder-decoder architecture. Given the limited pixel occupancy of small targets in images, traditional encoder layers with four levels of downsampling may lead to feature loss. Therefore, we opt for only two encoder and decoder layers. Figure 4 illustrates the network structure of FCNet, comprising one feature enhancement module (FEM) and two context fusion modules (CFM) in the encoder, two semantic fusion modules (SFM), three squeeze-and-excitation (SE) blocks, and one FCN Head in the decoder. Initially, the original image is fed into the FEM. The FEM consists of two deformable convolutions and a joint module of standard convolution and dilated convolution (CDCFM), amalgamating multiple convolutions to yield enhanced feature maps. Subsequently, the enhanced feature maps are input into two CFMs. CFM comprises a max-pooling layer, a CDCFM, and a deformable convolution layer, extracting fused information from contextual and local aspects to mitigate information loss during encoding. During the upsampling process, we employ the SFM. Initially, this module linearly interpolates the feature maps conveyed from the encoder layers for upsampling. It then integrates shallow semantic information and amalgamates deep and shallow semantic information through a CDCFM. Additionally, during upsampling, we introduce three SE blocks [38]. The SE block, a form of channel attention mechanism, aids in learning weights for different channels to attain superior feature representation. Ultimately, we utilize the FCN Head to output segmentation results.

3.2. Structure of FEM

In the standard convolution process, the fixed-shape convolution kernel easily loses edge information of the target. For small targets, occupying merely a dozen pixels, the loss of even a few pixel edge details is considerably significant. Processing the input feature map through deformable convolutions allows the model to learn the shape information of the target, thereby reducing the loss of edge details.

The convolution kernel size in standard convolutions is typically set to 3 × 3. Small kernels have a limited receptive field, are capable of extracting only local information, and thus may lose contextual semantic information, while excessively large kernels require more computational resources. To compensate for the loss of contextual semantics by standard convolutions, we introduce dilated convolutions. Given their larger receptive field, dilated convolutions capture multiscale contextual information, whereas standard convolutions obtain more precise local information. Hence, we devise the CDCFM module to balance contextual and local semantic information by amalgamating the feature maps generated by both. The CDCFM module comprises two branches. The first branch includes three dilated convolutions, each with dilation rates of 1, 2, and 3, respectively, to capture multiscale contextual information. The second branch consists of two standard convolutions with 3 × 3 kernel sizes, extracting local information. Following each convolutional layer, we incorporate batch normalization layers to expedite training and enhance model generalization. Subsequent to the batch normalization layers, ReLU activation functions are introduced to introduce non-linearity and improve the model’s expressive capability.

We have constructed an FEM based on the CDCFM and DCN modules, as illustrated in Figure 5. Initially, the input feature map

X \in R^{H \times W \times C}

undergoes processing via the DCN module to yield

Y_{D C N} \in R^{H \times W \times C}

, which adaptively learns receptive field information tailored to the contours of small targets. Subsequently, the output

Y_{D C N}

is fed into the CDCFM, where it merges the localized details from standard convolutions with the multiscale contextual information from three layers of dilated convolutions, culminating in feature representation enhancement to yield

Y_{m} \in R^{H \times W \times C}

. Following this,

Y \in R^{H \times W \times C}

is obtained through another pass via the DCN module, effectively compensating for the loss of edge information pertaining to small targets. The precise computational procedures are detailed in Equations (1) and (2):

Y_{D C N} = D C N (X)

(1)

Y = D C N ({D C o n v}^{3} (Y_{D C N}) \oplus {C o n v}^{2} (Y_{D C N}))

(2)

Here, DCN(·) denotes deformable convolution, DConv(·) denotes dilated convolution, with the superscript 3 indicating the operation of dilated convolution performed three times, Conv(·) denotes standard convolution, with the superscript 2 indicating the operation of standard convolution executed twice, and the symbol

\oplus

signifies direct addition of the corresponding values from the feature maps, followed by normalization through the ReLU function.

The original image, when processed through the FEM, not only acquires a wealth of contextual information through flexible receptive fields of varying sizes and shapes but also amalgamates precise local semantic information. Consequently, the extracted feature maps from the original image become more efficacious, furnishing a more favorable foundation for subsequent encoding procedures.

3.3. Structure of CFM

During the downsampling process, the resolution of the feature map gradually decreases, potentially leading to a significant loss of contextual information during pooling. Therefore, preserving the integrity of contextual semantic information becomes particularly crucial during downsampling.

Our proposed CFM aims to fuse contextual information after pooling, thereby reducing the loss of contextual information. As depicted in Figure 6, the input feature map

X \in R^{H \times W \times C}

undergoes maximum pooling to yield feature map

Y_{D o w n} \in R^{\frac{H}{2} \times \frac{W}{2} \times C}

. Subsequently,

Y_{D o w n}

undergoes contextual information fusion via the CDCFM to obtain feature map

Y_{m} \in R^{\frac{H}{2} \times \frac{W}{2} \times C}

. Finally, after adjusting the shape of receptive fields through DCN to preserve edge information, the output feature map

Y \in R^{\frac{H}{2} \times \frac{W}{2} \times C}

is obtained. Through two successive CFM module treatments in the encoder, feature map

Y_{n} \in R^{\frac{H}{4} \times \frac{W}{4} \times C}

is ultimately obtained. The detailed computational procedures are elucidated in Equations (3) and (4):

Y_{D o w n} = P_{m a x} (X)

(3)

Y = D C N ({D C o n v}^{3} (Y_{D o w n}) \oplus {C o n v}^{2} (Y_{D o w n}))

(4)

Here,

P_{m a x}

(·) denotes maximum pooling, DCN(·) represents deformable convolution, DConv(·) signifies dilated convolution, with the superscript 3 indicating the operation of dilated convolution performed three times, Conv(·) signifies standard convolution, with the superscript 2 indicating the operation of standard convolution executed twice, and the symbol

\oplus

denotes direct addition of the corresponding values from the feature maps.

Although the resolution of the feature map decreases during downsampling, the comprehensive fusion of contextual and local information in the low-resolution feature map is achieved through two CFM modules. This minimizes the loss of contextual information as much as possible. This design aims to balance the relationship between downsampling and contextual information, ensuring better performance in small target detection tasks.

3.4. Structure of SFM

The SFM module, situated within the decoder segment, is tasked with amalgamating semantic information from both deep and shallow layers to offset any potential semantic loss incurred during the encoding process, thereby acquiring more enriched semantic information. As depicted in Figure 7, initially, the feature map

X_{1} \in R^{H \times W \times C}

outputted by the encoder undergoes upsampling via linear interpolation to yield

Y_{L} \in R^{2 H \times 2 W \times C}

, which is then directly connected with the shallow feature map

X_{2} \in R^{2 H \times 2 W \times \frac{1}{2} C}

to obtain

Y_{U p} \in R^{2 H \times 2 W \times \frac{3}{2} C}

. Subsequently,

Y_{U p}

is fed into the CDCFM to integrate semantic information, ultimately yielding a feature map

Y \in R^{2 H \times 2 W \times \frac{1}{2} C}

that amalgamates shallow features and deep semantics. The detailed computational procedures are delineated in Equations (5) and (6):

Y_{U p} = C o n c a t (L e r p (X_{1}), X_{2})

(5)

Y = {D C o n v}^{3} (Y_{U p}) \oplus {C o n v}^{2} (Y_{U p})

(6)

Here, Concat(·) signifies the concatenation operation, Lerp(·) denotes linear interpolation, DConv(·) represents dilated convolution, with the superscript 3 indicating the operation of dilated convolution performed three times, Conv(·) denotes standard convolution, with the superscript 2 indicating the operation of standard convolution executed twice, and the symbol

\oplus

signifies direct addition of the corresponding values from the feature maps.

This design aims to enhance the performance of small target detection, particularly in addressing the issue of semantic loss, by fully integrating deep and shallow layer information through the SFM module during the decoding stage [39].

3.5. Other Modules

The SE block stands as a highly effective channel attention mechanism, characterized by its minimal parameter and computational footprint. Through straightforward compression and excitation operations, it discerns weightings across different channels, thereby acquiring enhanced feature representation. As illustrated in Figure 8, the SE block initiates by subjecting the input feature map

X \in R^{H \times W \times C}

to spatial feature compression, effectuated through global average pooling across spatial dimensions, yielding the feature map

Y_{1} \in R^{1 \times 1 \times C}

. Subsequently, via a fully connected (FC) layer, it derives a feature map

Y_{2} \in R^{1 \times 1 \times C}

imbued with channel attention. The feature map

Y_{2}

, augmented with channel attention, and the original input feature map X undergo channel-wise multiplication by the weight coefficients, culminating in the output of a feature map

Y \in R^{H \times W \times C}

endowed with channel attention.

The FCN Head demonstrates commendable performance in small target detection tasks, prompting its integration into the network’s terminus to yield detection outcomes. As depicted in Figure 9, the FCN Head comprises a 3 × 3 convolutional layer followed by batch normalization and activation layers, along with a dropout layer and a 1 × 1 convolutional layer, ultimately yielding segmented result maps.

3.6. Maritime-SIRST

The performance of deep learning algorithms is significantly influenced by the quality of the dataset [40]. Presently, there exist several IR small target datasets, primarily based on land and sky backgrounds, which fail to accurately reflect real scenarios under maritime conditions, rendering them unsuitable for training and evaluating ship targets at sea. Despite recent efforts to construct IR maritime ship target detection datasets such as ISDD and NUDT-SIRST-SEA, analysis reveals that these two datasets exhibit relatively singular scenes and cannot adequately represent the complexity of maritime backgrounds. Thus, the construction of a representative IR maritime ship detection dataset is crucial for this study. Such a dataset must provide authentic maritime backgrounds, encompassing diverse maritime conditions such as waves, islands, and cloud formations, to closely align with real-world applications. This dataset not only holds paramount significance for this study but also contributes to propelling the advancement of IR maritime ship detection. By employing datasets with authentic scenes, algorithms’ performance in maritime ship detection tasks can be more accurately assessed, thereby enhancing algorithm robustness and reliability.

To meet the aforementioned requirements, this study leverages publicly available images from the near-infrared band of the Landsat-8 remote sensing satellite to construct a high-quality IR ship detection dataset tailored for complex maritime scenes, named Maritime-SIRST. Within the Landsat-8 remote sensing images, we selected IR band images from various regions worldwide, including ports, canals, and open seas across Asia, Africa, and North America. The selected images span from 2013 to 2021, ensuring temporal diversity. To enhance representativeness, images with different months and varying cloud cover percentages were chosen. We utilized SenseTime’s open-source tool, LabelBee, to annotate the original images, generating label files in mask format. Compared to other datasets, Maritime-SIRST is specifically tailored for IR maritime ships, employing authentic remote sensing satellite IR images with more diverse and complex backgrounds. The proposed Maritime-SIRST dataset comprises 1131 images, totaling 2647 targets, with each image sized at 256 × 256 pixels. As illustrated in Table 1, compared to the other two public datasets (ISDD, NUDT-SIRST-SEA), the Maritime-SIRST dataset exhibits the following characteristics:

More complex backgrounds: The proportion of images with complex backgrounds in the Maritime-SIRST dataset reaches 65.43%, surpassing that of NUDT-SIRST-SEA (approximately 54%) and ISDD (approximately 30%).
Diverse false alarm target types: Maritime-SIRST includes various false alarm targets such as wave clutter, complex cloud formations, islands, and ports, with a substantial proportion. In contrast, NUDT-SIRST-SEA lacks wave clutter background images, with over 80% featuring simple backgrounds or land near ports, while ISDD contains only about 10% of wave clutter and complex cloud formation backgrounds, with the rest featuring simple backgrounds and port island backgrounds.
Smaller targets: According to the definition by the Society of Photo-Optical Instrumentation Engineers (SPIE), small targets are those with an area of fewer than 80 pixels in a 256 × 256 image. In Maritime-SIRST, over 95% of images meet this definition, significantly higher than NUDT-SIRST-SEA (approximately 90%) and ISDD (approximately 1.5%).
Diverse target sizes: While meeting the criterion of small target size, targets in Maritime-SIRST vary in size from 0 to 80 pixels, demonstrating a more uniform distribution. In contrast, about 70% of targets in NUDT-SIRST-SEA are smaller than 20 pixels, while in ISDD, 98.5% of targets exceed 80 pixels, indicating a lack of representativeness.
More diverse target numbers: Images in Maritime-SIRST encompass scenarios with no targets, single targets, and multiple targets, whereas NUDT-SIRST-SEA consists solely of images with multiple targets, and ISDD lacks images with no targets. Thus, Maritime-SIRST more authentically reflects real maritime scenes.

4. Results

4.1. Experiment Settings

To validate the performance of our model, we selected three traditional algorithms, namely Top-hat [41], MPCM [42], and TTLDM [43], along with six deep learning algorithms, including U-Net [24], DNANet [29], MTUNet [37], ABCNet [30], AGPCNet [44], and LW-IRSTNet [45], for comparative experiments. All models were run on a computer equipped with a 15vCPU AMD EPYC 7543 32-core processor and an NVIDIA GeForce RTX 3090 GPU using Python 3.8. Our model’s training parameters were configured with a batch size of 16, 1000 epochs, an SGD optimizer with an initial learning rate of 0.05, SoftIoULoss as the loss function, and other training parameters as specified in the original papers.

In order to verify the generalization performance of the model, we not only performed validation on our own constructed dataset, but we also used a public dataset for validation. The experiments utilized two datasets: One is our self-constructed Maritime-SIRST dataset, comprising 1131 images captured from authentic maritime scenes in the IR spectrum, each sized at 256 × 256 pixels. Among them, 800 images were allocated for training, while 331 were designated for validation. The second dataset is the publicly available NUDT-SIRST-SEA dataset, which includes 48 images sized at 10,000 × 10,000 pixels. Given the considerable computational resources and time required for training on these datasets and the presence of numerous low-quality images (e.g., predominantly landmasses, absence of targets), further processing was conducted. Initially, the original images were cropped into 256 × 256 images, followed by the removal of low-quality images, resulting in a selection of 2000 images. Among these, 1400 were assigned to the training set, while 600 were designated for validation. It is noteworthy that the ISDD dataset lacks semantic segmentation annotations and exhibits relatively simple backgrounds; hence, it was not utilized for comparative experiments.

4.2. Evaluation Metrics

In this article, Precision (Prec) [46], Recall (Rec) [46], mIoU [46], F1 [46], and AUC [46] are selected as the evaluation indexes of algorithm accuracy, and Params [45], FlOPs [45], and FPS [45] are selected as the evaluation metrics of computational complexity.

Prec, Rec, and F1 are commonly used metrics for assessing the accuracy of models in binary classification tasks and can be defined using a confusion matrix, as illustrated in Figure 10. Prec represents the proportion of true positives (TP) among all samples predicted as positive by the model. Rec indicates the proportion of true positives among all samples labeled as positive. The calculation formulas for Prec and Rec are as follows:

P r e c = \frac{T P}{T P + F P}

(7)

R e c = \frac{T P}{T P + F N}

(8)

Here, TP represents the number of detected target pixels matching real target pixels, FP represents the number of background pixels mistakenly detected as real target pixels, and FN represents the number of target pixels mistakenly detected as background pixels [47,48,49].

Due to different threshold settings in classification models, Prec may reach 100% at high thresholds, while Rec may be 100% at low thresholds. Therefore, these two metrics may be biased. To balance Prec and Rec, F1 is introduced, with the calculation formula as follows:

F 1 = \frac{2 \times P r e c \times R e c}{P r e c + R e c}

(9)

This formula indicates that F1 will be high only when both Prec and Rec are high. Thus, the higher the F1, the more effective the model’s performance [50,51].

mIoU, also known as the Jaccard similarity coefficient (JSC), refers to the mean intersection over union, which is the most commonly used semantic segmentation metric. In semantic segmentation, IoU represents the overlap between predicted and labeled pixel masks. The mean IoU (mIoU) is the arithmetic mean of IoU values for each class and is used to measure pixel overlap across the entire dataset, with the calculation formula as follows:

m I o U = \frac{1}{n} \times \sum_{i = 1}^{n} \frac{{T P}_{i}}{{T P}_{i} + {F P}_{i} + {F N}_{i}}

(10)

Here, n represents the total number of classes,

{T P}_{i}

represents true positives for the i-th class,

{F P}_{i}

represents false positives for the i-th class, and

{F N}_{i}

represents false negatives for the i-th class. For each class, its IoU is calculated, then the IoU values for all classes are summed and divided by the total number of classes to obtain the average IoU value.

ROC curve stands for receiver operating characteristic curve and is a commonly used tool for evaluating the performance of binary classification models. The ROC curve is plotted with a true positive rate on the vertical axis and a false positive rate on the horizontal axis. It illustrates the model’s classification performance for positive and negative instances at different thresholds. The area under the ROC curve (AUC) is widely used as a metric for evaluating model performance, with a value closer to 1 indicating better model performance.

Params usually refer to the number of learnable parameters in the model, including each value in the weight matrix in the convolution layer, fully connected layer, etc. It directly determines the size of the model and affects the amount of memory occupied during inference. Note that the number of parameters in the model is not the same as the size of the storage space.

FLOPs refers to the number of floating-point operations, understood as computational amount (computational time complexity), which can be used to measure the complexity of an algorithm, and is a common indirect standard to measure the speed of a neural network model. Although some recent studies have shown that relying on FLOPs alone to evaluate model speed may not be reliable because the computing speed of the model is also affected by factors such as memory throughput, FLOPs is still widely used as a reference standard for evaluating model speed. In computing, one multiplication or one addition is counted as one FLOP. The calculation amount of the conventional convolution layer (multiply-plus operand) is

F L O P = H \times W \times C \times N \times k \times k

(11)

where H, W, and C are respectively the height, width, and number of channels of the input, N is the number of channels of the output (that is, the number of filters), and k is the size of the convolution kernel.

FPS refers to the number of frames processed per second and is calculated by:

F P S = \frac{1}{t}

(12)

Here, t represents the time required for the network to predict the image, excluding the time required for post-processing, and FPS represents the number of frames per second that the network can process, which is the most direct indicator of the efficiency of the algorithm.

4.3. Quantitative Results

Table 2 presents a comparative analysis of experimental data between FCNet and other notable algorithms on the Maritime-SIRST dataset. From the data in the table, it is evident that on the Maritime-SIRST dataset, FCNet demonstrates a 3.59% improvement in Prec compared to other algorithms, a 1.42% increase in Rec, a 4.05% enhancement in mIoU, and a 2.43% advancement in F1. Notably, FCNet’s AUC is only slightly lower than that of the MTUNet algorithm. Overall, FCNet’s precision metrics surpass those of other algorithms, yielding promising outcomes.

As can be seen from Table 2, although traditional algorithms such as Top-hat, MPCM, and TTLDM can detect the target, there are a large number of false alarms and missed detection. This is because these model-driven algorithms can only identify the target in a simple background based on the prior assumptions of the model and cannot do so well when there is a complex background, such as sea surface. It leads to a high false alarm rate and missed detection rate. As the most basic deep learning algorithm, UNet has a simple structure, so it has the problem of a high false alarm rate and missing detection rate. DNANet benefits from the dense nested attention mechanism and achieves good results, but some features will be lost in the process of multiple downsampling and feature fusion, so it is easy to miss some dim small targets. With the introduction of ViT, the MTUNet algorithm can obtain global semantic information, but when the target is too small, the global semantic does not gain much, so it will produce a lot of false alarms and missed detection in the face of a complex background. Both ABCNet and AGPCNet focus on the integration of global semantics and local semantics and achieve good results. However, due to the fixed receptive field, some details are still lost in the downsampling process, and it is easy to miss the detection of dim small targets. LW-IRSTNet focuses on the lightweight design of the model structure, so some details are lost, resulting in high false alarm rate and missed detection rate. FCNet can obtain a flexible receptive field by integrating multiple convolutions, greatly reducing feature loss in the process of encoding and decoding, so that the background and target can be accurately separated, and the false alarm rate and missed detection rate can be reduced.

FCNet’s Params, FLOPs, and FPS have some gaps compared to lightweight algorithms such as LW-IRSTNet, but are still on a similar level compared to other deep learning algorithms. However, the accuracy of FCNet has been greatly improved compared with other models, so the FCNet model is the best according to all indexes.

Table 3 illustrates a comparison of experimental results between FCNet and other algorithms on the NUDT-SIRST-SEA dataset. As depicted in the table, FCNet exhibits a 1.51% improvement in Prec compared to other algorithms, a 0.65% increase in Rec, a 1.27% enhancement in mIoU, a 1.02% advancement in F1, and also demonstrates commendable performance in terms of AUC.

By comparing the experimental data of these algorithms in the two datasets, we found that the experimental data of FCNet and other algorithms showed similar trends in the two datasets. Therefore, it can be seen that FCNet has similar excellent performance in different data sets, which proves that it has good generalization performance.

Furthermore, experimental data from the Maritime-SIRST and NUDT-SIRST-SEA datasets were utilized to construct ROC curves, as illustrated in Figure 11. It can be observed from the ROC curves that the curve for FCNet closely approaches the top-left corner, trailing slightly behind MTUNet, indicating that the precision of our model, compared to other models and algorithms, operates at a relatively high level.

4.4. Visual Results

Figure 12 illustrates the comparative performance of FCNet and other algorithms on the Maritime-SIRST. Figure 13 displays the detection performance comparison of FCNet and other algorithms on the NUDT-SIRST-SEA. The red boxes represent accurately detected targets, blue boxes denote false alarms, and green boxes signify missed targets. From the detection results, it is evident that traditional algorithms exhibit poor performance, characterized by high rates of missed and false detections. Moreover, some deep learning algorithms also suffer from high false alarm rates and missed detections, with significant disparities between segmented targets and reality. This is primarily attributed to the complexity of maritime scenes, where phenomena such as cloud clusters, waves, and islands are prone to misidentification as targets, while dim small ship targets are easily submerged in the background, leading to missed detections. Overall, our algorithm exhibits lower false alarm and missed detection rates, and from the segmentation results, FCNet’s performance is the most outstanding. Therefore, the proposed FCNet algorithm is better suited for IR small ship detection in maritime environments.

4.5. Ablation Study

To investigate the impact of the encoder-decoder layers on the performance of IR small target detection, this study conducted ablation experiments on different encoder-decoder layers of UNet using the Maritime-SIRST dataset. The results, as shown in Table 4, show that in 2-layer UNet, the overall precision performance is the best, and compared with 4-layer UNet, the number of parameters and calculation amount are also greatly reduced. Although the number of parameters and calculation amount of 1-layer UNet are reduced a lot, the accuracy is reduced a lot. Therefore, the FCNet in this paper adopts the 2-layer encoder-decoder as its basic structure.

Subsequently, we performed ablation experiments on the introduced dilated convolutions (DC), deformable convolutions (DCN), and SE in the model. From the data in Table 5, it is evident that the incorporation of DC, DCN, and SE, compared to the base network, significantly enhances the model’s performance. Compared to the base network, FCNet demonstrates a 4.75% increase in Prec, a 2.82% increase in Rec, a 6.11% increase in mIoU, a 3.72% increase in F1, and a 1.02% increase in AUC.

To obtain a more intuitive understanding of the roles played by DC, DCN, and SE in the network, we visualized the feature maps at each stage of the FCNet network model. As shown in Figure 14, no-DCN represents the FCNet model without the DCN module, no-DC represents the FCNet model without the DC, and no-SE represents the FCNet model without the SE. From Figure 14, it can be seen that the network without DCN loses a lot of edge information near the target; the network without DC lacks contextual semantic information in the encoding stage; and the network without SE has chaotic channel information in the feature map, with insufficient prominent features near the target.

5. Conclusions

This chapter focuses on the design of a high-precision model, FCNet, for IR small ship detection on the sea surface, considering the small size of such ships, the lack of color and texture information, and the complexity of the maritime background. Specifically, to address the challenge of small ship size, a two-layer encoder-decoder structure is employed to prevent the loss of small ships during downsampling. Subsequently, an FEM is proposed to enrich features by introducing dilated convolutions and deformable convolutions. Furthermore, a context fusion module is introduced to fuse multi-scale contextual information and local information to obtain richer semantic information. Then, a Semantic Fusion Module is proposed to integrate low-level features with deep semantic information. Finally, an attention module, SE, is incorporated into the decoding layer to adaptively learn the weights of channels and obtain more effective channel information. Experimental results on two datasets demonstrate that FCNet outperforms other algorithms in terms of accuracy metrics. Additionally, through ablation experiments, the effectiveness of adding modules is further verified. Moreover, a dataset focusing on IR small ships on the sea surface, named Maritime-SIRST, is proposed. This dataset encompasses various complex scenarios in maritime backgrounds, such as waves, complex cloud clusters, and port islands, closely resembling real-world applications. The dataset not only meets the research requirements of this study but also fosters further development in this field.

While this study concentrates on enhancing the precision of IR small ship detection at sea, the introduction of multiple modules to improve network accuracy also increases algorithm complexity. It is noteworthy that IR small ship detection algorithms are typically deployed on resource-constrained embedded devices, such as satellites and drones. Due to the limited memory and computing power of these devices, strict requirements are imposed on the parameter size and computational complexity of the algorithm model. To meet the needs of practical engineering applications, future directions should focus on the lightweight design of the model while maintaining accuracy, aiming to reduce the parameter size and computational load.

Author Contributions

Conceptualization, F.G.; methodology, F.G.; software, F.G.; formal analysis, F.G. and Z.J.; investigation, F.G. and M.L.; resources, F.G.; data curation, F.G. and M.L.; writing—original draft preparation, F.G.; writing—review and editing, H.M., Z.J. and L.L.; visualization, F.G. and L.L.; supervision, H.M., Z.J. and M.L.; project administration, F.G. and H.M.; funding acquisition, H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China, grant number 62261053; the Cross-Media Intelligent Technology Project of Beijing National Research Center for Information Science and Technology (BNRist), grant number BNR2019TD01022; the Tianshan Talent Training Project-Xinjiang Science and Technology Innovation Team Program, grant number 2023TSYCTD0012.

Data Availability Statement

The Maritime-SIRST is available on GitHub at: https://github.com/peerless66/Maritime-SIRST (accessed on 25 April 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, N.; Li, B.; Wei, X.; Wang, Y.; Yan, H. Ship detection in spaceborne infrared image based on lightweight CNN and multisource feature cascade decision. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4324–4339. [Google Scholar] [CrossRef]
Li, L.; Jiang, L.; Zhang, J.; Wang, S.; Chen, F. A complete YOLO-based ship detection method for thermal infrared remote sensing images under complex backgrounds. Remote Sens. 2022, 14, 1534. [Google Scholar] [CrossRef]
Li, L.; Ma, H.; Jia, Z. Multiscale geometric analysis fusion-based unsupervised change detection in remote sensing images via FLICM model. Entropy 2022, 24, 291. [Google Scholar] [CrossRef]
Wu, P.; Huang, H.; Qian, H.; Su, S.; Sun, B.; Zuo, Z. SRCANet: Stacked residual coordinate attention network for infrared ship detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5003614. [Google Scholar] [CrossRef]
Li, Y.; Xu, Q.; He, Z.; Li, W. Progressive task-based universal network for raw infrared remote sensing imagery ship detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610013. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Yu, Y.; Peng, Z.; Huang, F.; Fu, Q. Infrared small target tracking algorithm via segmentation network and multi-strategy fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5612912. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Peng, Z.; Zhao, Z.; Chen, Y.; Han, J.; Huang, F.; Yu, Y.; Fu, Q. Infrared small target segmentation networks: A survey. Pattern Recognit. 2023, 143, 109788. [Google Scholar] [CrossRef]
Zhang, W.; Wu, Q.M.J.; Yang, Y.; Akilan, T.; Zhao, W.G.W.; Li, Q.; Niu, J. Fast ship detection with spatial-frequency analysis and ANOVA-based feature fusion. IEEE Geosci. Remote Sens. Lett. 2021, 19, 3506305. [Google Scholar] [CrossRef]
Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-frame infrared small-target detection: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar] [CrossRef]
Dai, Y.; Li, X.; Zhou, F.; Qian, Y.; Chen, Y.; Yang, J. One-stage cascade refinement networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000917. [Google Scholar] [CrossRef]
Li, L.; Lv, M.; Jia, Z.; Ma, H. Sparse representation-based multi-focus image fusion method via local energy in shearlet domain. Sensors 2023, 23, 2888. [Google Scholar] [CrossRef]
Li, Y.; Li, Z.; Zhang, C.; Luo, Z.; Zhu, Y.; Ding, Z.; Qin, T. Infrared maritime dim small target detection based on spatiotemporal cues and directional morphological filtering. Infrared Phys. Technol. 2021, 115, 103657. [Google Scholar] [CrossRef]
Yang, P.; Dong, L.; Xu, W. Infrared small maritime target detection based on integrated target saliency measure. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2369–2386. [Google Scholar] [CrossRef]
Han, J.; Liu, S.; Qin, G.; Zhao, Q.; Zhang, H.; Li, N. A local contrast method combined with adaptive background estimation for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1442–1446. [Google Scholar] [CrossRef]
Qian, J.; Zhou, H.; Qian, K.; Zhao, D.; Qin, H.; Song, S.; Wu, Z. Infrared dim moving target tracking via improved context learning. In Selected Papers of the Chinese Society for Optical Engineering Conferences Held October and November 2016; SPIE: Suzhou, China, 2017; Volume 10255, pp. 1309–1317. [Google Scholar]
Chen, Y.; Zhang, G.; Ma, Y.; Kang, J.U.; Kwan, C. Small infrared target detection based on fast adaptive masking and scaling with iterative segmentation. IEEE Geosci. Remote Sens. Lett. 2021, 19, 7000605. [Google Scholar] [CrossRef]
Guan, X.; Zhang, L.; Huang, S.; Peng, Z. Infrared small target detection via non-convex tensor rank surrogate joint local contrast energy. Remote Sens. 2020, 12, 1520. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Fu, Q.; Yu, Y.; Zhang, D. Infrared small target detection based on the improved density peak global search and human visual local contrast mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6144–6157. [Google Scholar] [CrossRef]
Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared small-target detection U-Net. IEEE Geosci. Remote Sens. Lett. 2022, 19, 7506205. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Ttrans. Pattern Anal. Mach. Intell. 2019, 39, 1137–1149. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Wang, H.; Zhou, L.; Wang, L. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8509–8518. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 950–959. [Google Scholar]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
Pan, P.; Wang, H.; Wang, C.; Nie, C. ABC: Attention with bilinear correlation for infrared small target detection. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2381–2386. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14408–14419. [Google Scholar]
Xiong, Y.; Li, Z.; Chen, Y.; Wang, F.; Zhu, X.; Luo, J.; Wang, W.; Lu, T.; Li, H.; Qiao, Y.; et al. Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications. arXiv 2024, arXiv:2401.06197. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Han, Y.; Liao, J.; Lu, T.; Pu, T.; Peng, Z. KCPNet: Knowledge-driven context perception networks for ship detection in infrared imagery. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5000219. [Google Scholar] [CrossRef]
Wu, T.; Li, B.; Luo, Y.; Wang, Y.; Xiao, C.; Liu, T.; Yang, J.; An, W.; Guo, Y. MTU-Net: Multilevel transunet for space-based infrared tiny ship detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5601015. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; pp. 3559–3568. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
Mu, J.; Li, W.; Rao, J.; Li, F.; Wei, H. Infrared small target detection using tri-layer template local difference measure. Opt. Precis. Eng. 2022, 30, 869–882. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-guided pyramid context networks for detecting infrared small target under complex background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Yu, Y.; Peng, Z.; Yang, M.; Huang, F.; Fu, Q. LW-IRSTNet: Lightweight infrared small target segmentation network and application deployment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5621313. [Google Scholar] [CrossRef]
Guo, F.; Ma, H.; Li, L.; Lv, M.; Jia, Z. Multi-attention pyramid context network for infrared small ship detection. J. Mar. Sci. Eng. 2024, 12, 345. [Google Scholar] [CrossRef]
Li, L.; Ma, H.; Jia, Z. Change detection from SAR images based on convolutional neural networks guided by saliency enhancement. Remote Sens. 2021, 13, 3697. [Google Scholar] [CrossRef]
Zhang, X.; Li, Q. FD-Net: Feature distillation network for oral squamous cell carcinoma lymph node segmentation in hyperspectral imagery. IEEE J. Biomed. Health Inform. 2024, 28, 1552–1563. [Google Scholar] [CrossRef]
Zhang, X.; Li, W. Hyperspectral pathology image classification using dimension-driven multi-path attention residual network. Expert Syst. Appl. 2023, 230, 120615. [Google Scholar] [CrossRef]
Li, L.; Ma, H.; Jia, Z. Gamma correction-based automatic unsupervised change detection in SAR images via FLICM model. J. Indian Soc. Remote Sens. 2023, 51, 1077–1088. [Google Scholar] [CrossRef]
Li, L.; Ma, H.; Zhang, X.; Zhao, X.; Lv, M.; Jia, Z. Synthetic aperture radar image change detection based on principal component analysis and two-level clustering. Remote Sens. 2024, 16, 1861. [Google Scholar] [CrossRef]

Figure 1. Comparison of IR images between maritime scenes and other scenes. (a–c) represent IR images with sky, mountainous and urban, respectively; (d–f) represent IR images with complex cloud, ocean waves and islands, respectively.

Figure 2. Schematic diagram of dilated convolution.

Figure 3. Schematic diagram of deformable convolution.

Figure 4. Structure of FCNet.

Figure 5. Structure of feature enhancement module (FEM).

Figure 6. Structure of context fusion module (CFM).

Figure 7. Structure of semantic fusion modules (SFM).

Figure 8. Structure of squeeze-and-excitation (SE) block.

Figure 9. Structure of FCN Head.

Figure 10. Confusion matrix.

Figure 11. ROC curves of different methods. (a) Maritime-SIRST; (b) NUDT-SIRST-SEA.

Figure 12. Partial image visualization results of different methods on Maritime-SIRST. The red box, the blue box, and the green box represent the correct detection box, the false detection box, and the missed detection box, respectively. Some algorithms exhibit an excessive number of false alarms; hence, annotations are omitted in the images. (a) Showcase the image segmentation results with cloud clusters; (b) demonstrate the image segmentation results with islands; (c) illustrate the image segmentation results with dim small ships.

Figure 13. Partial image visualization results of different methods on NUDT-SIRST-SEA. The red box, the blue box, and the green box represent the correct detection box, the false detection box, and the missed detection box, respectively. Some algorithms exhibit an excessive number of false alarms; hence, annotations are omitted in the images. (a) Showcase the image segmentation results with cloud clusters; (b) demonstrate the image segmentation results with islands; (c) illustrate the image segmentation results with dim small ships.

Figure 14. Visualization of feature maps at various stages of the network model. (a–h) represent the original image, feature-enhanced stage feature map, first encoding feature map, second encoding feature map, first decoding feature map, second decoding feature map, detection head output feature map, and groundtruth.

Table 1. The statistical data of Maritime-SIRST, NUDT-SIRST-SEA and ISDD.

Features		Maritime-SIRST	NUDT-SIRST-SEA	ISDD
Image number		1131	48	1284
Total target number		2647	16,929	3061
Image size		256 × 256	10,000 × 10,000	500 × 500
Background	simple	391/34.57%	22/45.83%	894/69.63%
	waves	123/10.87%	0/0%	33/2.57%
	clouds	534/47.21%	8/16.67%	86/6.70%
	islands, ports	83/7.35%	18/37.50%	271/21.11%
Target size	<20	937/35.40%	11,734/69.31	0/0%
	20–50	1336/50.47%	2254/13.31	9/0.29%
	50–80	266/10.05%	1280/7.56	37/1.21%
	>80	108/4.08%	1661/9.82	3015/98.5%
Number of targets	0	21/1.86%	0/0%	0/0%
	1	724/64.01%	0/0%	659/51.32%
	multiple	386/34.13%	48/100%	625/48.69%

Table 2. The comparative experimental data between FCNet and other algorithms on Maritime-SIRST.

Methods	Prec/%	Rec/%	mIoU/%	F1/%	AUC/%	Params/M	FLOPs/G	FPS
Top-hat	58.17	17.94	14.19	27.42	57.49	/	/	464.73
MPCM	45.02	47.58	12.91	46.26	54.16	/	/	1.02
TTLDM	69.00	20.23	17.81	31.29	56.23	/	/	107.26
UNet	90.92	83.97	77.47	87.31	93.16	31.04	54.74	237.04
DNANet	91.20	86.89	80.17	89.00	94.19	4.7	14.28	54.71
MTUNet	83.87	80.81	69.94	82.31	97.43	6.24	5.71	344.86
ABCNet	90.60	84.06	77.55	87.36	93.19	73.51	83.13	49.34
AGPCNet	86.79	81.55	72.55	84.09	91.93	12.36	43.18	121.77
LW-IRSTNet	84.65	74.73	65.81	79.38	36.52	0.16	0.30	289.39
FCNet	94.79	88.31	84.22	91.43	94.76	4.98	73.72	100.62

Notes: The best of these metrics are shown in red bold font, and the second-best metrics are shown in blue bold font.

Table 3. The comparative experimental data between FCNet and other algorithms on NUDT-SIRST-SEA.

Methods	Prec/%	Rec/%	mIoU/%	F1/%	AUC/%
Top-hat	58.20	44.83	29.53	50.65	57.35
MPCM	48.27	61.89	25.63	54.24	68.41
TTLDM	62.39	48.46	33.84	54.55	61.41
UNet	77.68	67.15	56.29	72.03	84.18
DNANet	78.63	67.80	57.25	72.82	85.00
MTUNet	74.05	64.50	52.61	68.94	90.28
ABCNet	77.84	67.77	56.79	72.44	84.74
AGPCNet	77.60	66.17	55.56	71.43	81.93
LW-IRSTNet	76.68	62.40	52.45	68.81	21.51
FCNet	80.14	68.45	58.52	73.84	84.93

Notes: The best of these metrics are shown in red bold font, and the second-best metrics are shown in blue bold font.

Table 4. Ablation study on the impact of different numbers of encoder-decoder layers.

Methods	Prec/%	Rec/%	mIoU/%	F1/%	AUC/%	Params/M	FLOPs/G	FPS
UNet-4	90.92	83.97	77.47	87.31	93.16	31.04	54.74	237.04
UNet-3	89.78	83.72	76.44	86.64	93.06	7.7	41.71	273.30
UNet-2	90.05	85.49	78.11	87.71	93.74	1.86	28.68	329.73
UNet-1	88.18	81.59	73.55	84.76	92.06	0.4	15.63	368.25

Notes: The best of these metrics are shown in red bold font.

Table 5. Ablation study on the impact of the DC, SE, and DCN.

Methods	Prec/%	Rec/%	mIoU/%	F1/%	AUC/%	Params/M	FLOPs/G	FPS
Base	90.05	85.49	78.11	87.71	93.74	1.86	28.68	329.73
Base + DC	92.70	86.02	80.56	89.23	93.94	4.53	65.80	205.61
Base + SE	91.91	86.29	80.20	89.01	94.07	1.90	29.83	267.76
Base + DCN	93.32	85.78	80.82	89.39	93.59	2.15	32.45	187.84
Base + DC + SE	93.01	87.76	82.33	90.31	94.57	4.75	71.18	142.66
Base + DC + SE + DCN	94.79	88.31	84.22	91.43	94.76	4.98	73.72	100.62

Notes: The best of these metrics are shown in red bold font.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, F.; Ma, H.; Li, L.; Lv, M.; Jia, Z. FCNet: Flexible Convolution Network for Infrared Small Ship Detection. Remote Sens. 2024, 16, 2218. https://doi.org/10.3390/rs16122218

AMA Style

Guo F, Ma H, Li L, Lv M, Jia Z. FCNet: Flexible Convolution Network for Infrared Small Ship Detection. Remote Sensing. 2024; 16(12):2218. https://doi.org/10.3390/rs16122218

Chicago/Turabian Style

Guo, Feng, Hongbing Ma, Liangliang Li, Ming Lv, and Zhenhong Jia. 2024. "FCNet: Flexible Convolution Network for Infrared Small Ship Detection" Remote Sensing 16, no. 12: 2218. https://doi.org/10.3390/rs16122218

APA Style

Guo, F., Ma, H., Li, L., Lv, M., & Jia, Z. (2024). FCNet: Flexible Convolution Network for Infrared Small Ship Detection. Remote Sensing, 16(12), 2218. https://doi.org/10.3390/rs16122218

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FCNet: Flexible Convolution Network for Infrared Small Ship Detection

Abstract

1. Introduction

2. Related Work

2.1. IR Small Target Detection Algorithm Based on Deep Learning

2.2. Dilated Convolution and Deformable Convolution

2.3. IR Small Ship Detection Dataset

3. Materials and Methods

3.1. Structure of FCNet

3.2. Structure of FEM

3.3. Structure of CFM

3.4. Structure of SFM

3.5. Other Modules

3.6. Maritime-SIRST

4. Results

4.1. Experiment Settings

4.2. Evaluation Metrics

4.3. Quantitative Results

4.4. Visual Results

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI