Next Article in Journal
Effects of River Input on the Inorganic Nitrogen Components in Estuary–Bay Waters: Zhanjiang Bay, China
Previous Article in Journal
Turbulent Hydrofoil Cavitation Simulations: Applications of RANS with Eddy Viscosity and Interfacial Turbulence Damping and LES
Previous Article in Special Issue
Towards LLM Enhanced Decision: A Survey on Reinforcement Learning Based Ship Collision Avoidance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Development and Application of an Intelligent Recognition System for Polar Environmental Targets Based on the YOLO Algorithm

1
Navigation College, Dalian Maritime University, Dalian 116026, China
2
Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai 519082, China
3
College of Environmental Science and Engineering, Dalian Maritime University, Dalian 116026, China
4
Panjin Maritime Safety Administration, Panjin 124211, China
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2025, 13(12), 2313; https://doi.org/10.3390/jmse13122313
Submission received: 2 November 2025 / Revised: 24 November 2025 / Accepted: 2 December 2025 / Published: 5 December 2025

Abstract

As global climate warming enhances the navigability of Arctic routes, their navigation value has become prominent, yet ships operating in ice-covered waters face severe threats from sea ice and icebergs. Existing manual observation and radar monitoring remain limited, highlighting an urgent need for efficient target recognition technology. This study focuses on polar environmental target detection by constructing a polar dataset with 1342 JPG images covering four classes, including sea ice, icebergs, ice channels, and ships, obtained via web collection and video frame extraction. The “Grounding DINO pre-annotation + LabelImg manual fine-tuning” strategy is employed to improve annotation efficiency and accuracy, with data augmentation further enhancing dataset diversity. After comparing YOLOv5n, YOLOv8n, and YOLOv11n, YOLOv8n is selected as the baseline model and improved by introducing the CBAM/SE attention mechanism, SCConv/AKConv convolutions, and BiFPN network. Among these models, the improved YOLOv8n + SCConv achieves the best in polar target detection, with a mean average precision (mAP) of 0.844–1.4% higher than the original model. It effectively reduces missed detections of sea ice and icebergs, thereby enhancing adaptability to complex polar environments. The experimental results demonstrate that the improved model exhibits good robustness in images of varying resolutions, scenes with water surface reflections, and AI-generated images. In addition, a visual GUI with image/video detection functions was developed to support real-time monitoring and result visualization. This research provides essential technical support for safe navigation in ice-covered waters, polar resource exploration, and scientific activities.

1. Introduction

Global climate warming has enhanced the navigability of Arctic waterways, which exhibit significant advantages in terms of geopolitics, economy, and resource acquisition. However, ships navigating in ice-covered waters face substantial threats from icebergs and sea ice. Traditional manual observation and radar monitoring have inherent limitations, creating an urgent need for reliable target recognition methods.
In the context of polar environmental target detection, high-quality datasets and efficient detection methods are of critical importance. They help capture the characteristics of polar targets, provide references for navigation decision making in ice zones, and enhance navigation efficiency and safety. Target detection technology can accurately identify obstacles, support route planning, and provide assistance for polar resource development, scientific research, and ecological protection [1,2]. Additionally, the integration of unmanned aerial vehicles (UAVs) with image recognition technology can further ensure the safety of ships navigating in polar regions [3].
As shown in Figure 1, the dataset constructed in this paper includes four types of polar targets: sea ice, icebergs, ice channels, and ships.

1.1. Research Status of Polar Target Detection

At present, some countries have initiated automatic target detection early with remarkable technical achievements, but few studies have applied deep learning-based target detection algorithms (e.g., the evolving YOLO series [3]) directly to polar sea ice recognition. These algorithms’ progress in speed, accuracy, and efficiency provides valuable references for polar applications.
In polar glacier detection, foreign research focuses on glaciers: optical/infrared remote sensing combined with field measurements estimate glacier parameters [4], while ASTER satellite stereo imaging [5] and ICESat GLAS laser altimetry [6] monitor volume changes. Sungwook Hong [7] derived sea ice’s small-scale roughness (0.25–0.5 cm) and refractive index (1.6–1.8 in winter; 1.2–1.4 in summer) via AMSR-E data for sea ice–seawater differentiation. M. Gupta et al. [6] found that active microwave data outperform passive data in classifying Arctic sea ice surface roughness. A.D. Fraser et al. [5] used MODIS image synthesis to detect Antarctic fast ice, and Y. Han et al. [8] proposed a hyperspectral method with 91.18% (Baffin Bay) and 94.22% (Bohai Bay) accuracy via SVM. H. Zhou [9] improved V-shaped ship wake detection in SAR images with a combined filtering–Radon transform method.
Domestic research started later but made progress. Studies on Qinghai–Tibet Plateau glacier detection provide technical support (e.g., [4]). Zhang Daqi et al. [4]’s U-PSP-Net (improved U-Net) enhances multi-scale feature extraction for shadowed glacier recognition, optimizing MPA and IoU. Zheng Minwei et al. [10,11] developed SVM-based methods using Gaofen-3/Sentinel-1 SAR data, achieving over 90% accuracy in sea ice–seawater separation. Zhao Chaofang et al. [12] compared algorithms (Bayesian, SVM, etc.) with HY-2A/SCAT data, finding that Bayesian algorithms have the least misjudgment. Xu Changjing [13] used multi-source scatterometer data to generate daily sea ice coverage maps (2019–2021) and Arctic winter sea ice type maps (2019–2022).
By now, many studies have been carried out on glacier and sea ice monitoring, and they have used optical remote sensing, infrared remote sensing, laser altimetry, and passive or active microwave data to analyze glacier volume changes, sea ice roughness, and sea ice distribution. However, these studies have usually focused on a single type of target and have lacked unified detection of multiple polar targets, including icebergs, sea ice, ice channels, and ships. In addition, existing polar environmental target detection methods are limited by the lack of data and low detection accuracy, which makes it difficult to meet the safety and efficiency requirements of polar navigation. Polar environmental datasets often contain only a small number of samples and lack diversity, which cannot support deep learning models in learning the complexity of polar scenes. Although progress has been made in sea ice recognition, there are still few studies that directly apply deep learning object detection algorithms to polar sea ice, and the use of YOLO for automatic detection of polar environmental targets is even more limited. As a result, there is still no lightweight and real-time detection model, together with a practical system, that is designed for polar scenes and suitable for shipborne or UAV images. This is the research gap that this study aims to address.

1.2. Significance

This study addresses the scarcity of polar environmental target data and the limited recognition capability of traditional monitoring methods that rely on manual observation and marine radar. A polar environmental target dataset covering icebergs, sea ice, ice channels, and ships is constructed, and an improved YOLOv8n detection model, together with an intelligent recognition system, is developed. The proposed image-based method provides a more intuitive and automated way for ships to detect targets in complex ice-covered waters, which helps identify potential hazards earlier and improves the safety and decision-making efficiency of polar navigation.
At the academic level, this work provides an integrated framework that links dataset construction, model optimization, and system implementation for polar environmental target detection. The study enriches data resources for polar targets and explores deep learning-based target detection methods that are adapted to polar scenes. At the engineering level, the improved detection model and the visual recognition system can support image- and video-based monitoring, offering technical support for safe navigation, polar resource exploration, and scientific investigation in ice-covered waters.

1.3. Research Objectives and Paper Organization

Building on the above background and significance, this study focuses on three main tasks: constructing a polar environmental target dataset, improving a YOLO-based detection model, and developing an intelligent recognition system. The goal is to provide a lightweight and real-time polar target detection solution that can be applied to complex polar scenes.
Specifically, the objectives are as follows. The first objective is to construct a dataset with diverse sources and accurate annotations that can support model training in polar environmental target detection. The second objective is to explore an efficient pre-annotation strategy based on Grounding DINO, combined with manual correction using LabelImg, in order to reduce the manual labeling workload while maintaining annotation quality. The third objective is to introduce attention mechanisms, convolutional structures, and a BiFPN feature fusion module into the YOLOv8n framework, so as to enhance the ability of the model to detect multi-scale targets in complex polar backgrounds. The fourth objective is to develop a visual recognition system with a graphical user interface that integrates image detection, video detection, and real-time monitoring, providing more reliable visual detection tools for polar navigation and scientific observation.
The remainder of this paper is organized as follows. Section 2 reviews convolutional neural networks and typical object detection algorithms, providing the theoretical basis for model selection. Section 3 describes the construction of the polar target dataset, including image acquisition, pre-annotation, manual refinement, and data augmentation strategies. Section 4 presents polar environmental target detection experiments based on YOLO models, compares the performance of different YOLO versions, and proposes several improvements to enhance detection accuracy and stability. Section 5 shows the inference results of the improved model and introduces the visual recognition system to demonstrate its application. Finally, Section 6 summarizes the main findings of this study and outlines future research directions.

2. Related Work and Methodological Background

As shown in Figure 2, the introduction of the CBAM/SE attention mechanism is illustrated. The CBAM (convolutional block attention module) attention mechanism combines channel attention and spatial attention [14]. For an input feature map F with size H × W × C, channel attention first performs global average pooling and max pooling to obtain two feature maps, A and B, with size 1× 1 × C. After fully connected layers and sigmoid activation, a weight coefficient, Mc, is obtained. Multiplying Mc by F produces a feature map, F1. Then, F1 is input to spatial attention, where average pooling, max pooling, concatenation, and convolution are performed again to obtain a feature map with size H × W × 1. After a convolution and sigmoid activation, a feature map, Ms, with size H × W × 1 is obtained. Finally, multiplying Ms by F1 yields a new feature map with size H × W × C. The SE (squeeze-and-excitation) attention mechanism globally averages the input feature map, changing its dimension to 1 × 1 × C. After fully connected layers and sigmoid activation, a weight is obtained and multiplied by the input for the output. The number of neurons in the fully connected layer matches the number of input channels to ensure the number of channels remains unchanged [14].
Introducing SCConv convolution: SCConv includes channel reconstruction (CRU) and spatial reconstruction (SRU) modules. CRU uses a split–transform–merge strategy to reduce channel redundancy; SRU first normalizes the input features, then uses Sigmoid to generate weights to separate features, and finally transforms and reorganizes to output spatially refined features. When CRU processes the input feature Xin, it first splits Xin into two parts, which are convolved with different kernel sizes of 1 × 1 to obtain Xsp and Xgp. Then, through group-wise convolution (GWC) and point-wise convolution (PWC) transformations, Y1 and Y2 are obtained. After pooling and softmax weighted fusion, a channel-refined feature, Y, is obtained, which can improve feature representation efficiency and reduce model parameters and computational costs.
Figure 3 illustrates the AKConv convolution: Traditional convolution operations are limited by the local window and can only obtain local information within the current window. The size and shape of the convolution kernel are fixed, and the number of parameters grows quadratically with the size of the convolution kernel (e.g., 4 × 4 and 5 × 5 square grids), making it difficult to adapt to changing targets. The deformable convolution (AKConv) provides a flexible mechanism, supporting any sampling shape and number of parameters for the convolution kernel (e.g., 5, 7, or 11). It also proposes a new coordinate generation algorithm, using an irregular convolution with the initial position and offset to complete efficient feature extraction, helping to balance network overhead and performance.
Fusing the BiFPN network: The bi-directional feature pyramid network (BiFPN) was proposed in the EfficientDet model. Figure 4 shows that the BiFPN introduces bidirectional flow feature information, fusing upsampled and downsampled feature maps layer by layer. It also introduces horizontal and vertical links, allowing better fusion of features of different scales. The BiFPN is an improvement of the PAN structure. Compared with the original network structure, it adds an additional path between input and output nodes at the same level, enhancing the network’s information extraction ability and, thus, improving the robustness and accuracy of target detection [15].

3. Research on Target Detection Algorithms Based on Convolutional Neural Networks

3.1. Evaluation Metrics for Detection Performance

To comprehensively measure the detection performance and real-time capability of the proposed algorithm, three core metrics are adopted in this study: average precision (AP), mean average precision (mAP), and frames per second (FPS). These metrics quantify the algorithm’s performance from the perspectives of detection quality and operational efficiency, respectively.
Recall (R) reflects the model’s ability to identify positive samples, with its calculation method referenced in [16]. Precision (P) indicates the proportion of actual positive samples among the predicted positive ones, whose computation is detailed in [16].
Average precision (AP) measures the detection performance for a single category, derived from the area under the precision–recall (P-R) curve, as defined in [17]. Mean average precision (mAP) synthesizes the detection performance across multiple categories, calculated as the average of AP values for all categories [17]. Two common variants are used in practical applications: mAP0.5 (IoU threshold = 0.5) and mAP0.5:0.95 (IoU thresholds from 0.5 to 0.95, with a step of 0.05), both referenced in [17].
FPS is a key indicator of the model’s real-time performance, representing the number of frames processed per second, and its calculation is based on [18].
These metrics originate from established works in object detection (e.g., [16,17,18]), enabling systematic quantification of the proposed algorithm’s “detection accuracy” and “real-time operational efficiency”. They provide a unified and objective measurement standard for the comparative analysis of subsequent experimental results.

3.2. Comparison of Target Detection Algorithms

To comprehensively evaluate the performance of YOLOv8n, this section conducts a comparative analysis with three representative target detection algorithms: Faster R-CNN [19], SSD [20], and RetinaNet. The qualitative conclusions summarized in Table 1 are derived from a systematic comparison directly against extensive benchmark data from the official YOLO documentation, including FPS, mAP, and model parameters across multiple datasets, combined with a comprehensive review of YOLO architectures [3] that integrates official YOLO performance metrics and cross-algorithm comparative results. Given the substantial scale of the original performance data provided on the official platform, the table presents condensed comparative results focusing on core performance dimensions to ensure clarity and readability.
As observed from Table 1, the single-stage detection architecture endows YOLOv8n with inherent advantages in detection speed and memory efficiency, key requirements for polar target detection. Polar detection relies on shipborne or UAV-based embedded devices with limited computing power and strict resource constraints [1], and ice floes and icebergs in polar regions are dynamic, requiring real-time response to support navigation and collision avoidance [21]. Faster R-CNN, a typical two-stage detector, achieves higher accuracy but has slower real-time performance and larger memory usage, making it unsuitable for resource-constrained polar scenarios. Compared with SSD [20], YOLOv8n outperforms in both speed and memory footprint. While SSD has marginally higher accuracy, YOLOv8n’s balanced performance is more aligned with polar environmental target detection demands, where real-time response and hardware resource limitations are critical. RetinaNet excels at small-target detection but suffers from slow inference and large memory consumption, failing to meet the real-time monitoring needs of polar ice conditions. In contrast, YOLOv8n maintains speed and memory advantages while ensuring acceptable accuracy, addressing the challenge of coexisting large icebergs and small broken ice in polar scenes [3].
Considering the requirements of polar environmental target detection—real-time performance, accuracy, and adaptability to resource-constrained devices [1,21]—YOLOv8n is selected as the core algorithm. This choice is justified by its advanced network structure: it adopts the C2f module and efficient feature fusion to adapt to complex polar scenarios; the Anchor-Free detection head simplifies the structure, accelerates inference speed, and enhances small-target detection while reducing computational cost and memory usage; it optimizes bounding-box regression with CIoU loss and DFL loss, and uses binary cross-entropy for classification to strengthen detection robustness; and the neck module adopts the PAN concept for multi-scale feature fusion, improving the representation of targets at different scales [3].
In summary, YOLOv8n demonstrates excellent performance in network structure, detection speed, accuracy, and memory usage, which are well-matched to the resource constraints and real-time detection needs of polar environmental target tasks [3]. Therefore, this paper selects YOLOv8n as the core algorithm and proceeds with further in-depth improvements and optimizations.

3.3. ACON Activation Function Family

To address the limitations of ReLU-series activation functions (e.g., slow convergence in the negative half-axis, non-smoothness at the zero position, and lack of adaptive activation capability in SiLU), the ACON (activate or not) activation function family is adopted in this study. Its core design lies in adaptively switching between linear and nonlinear modes to dynamically determine the activation state of neurons, as detailed in [22].
ACON is grounded in the approximate smoothing of the standard maximization function, with its mathematical expression given by
S β ( x 1 , , x n ) = i = 1 n x i e β x i i = 1 n e β x i
where n denotes the dimension of the input vector, x i represents the input element, and β is the switching factor. As β , S β approximates the max operation (nonlinear activation mode); as β 0 , S β approximates the mean operation (linear deactivation mode), thereby adaptively governing the activation state of neurons.
Among the ACON family’s formulations, ACON-A (an approximate smoothing of ReLU) is defined as
S β ( η a ( x ) , η b ( x ) ) = S β ( x , 0 ) = x · σ ( β x )
where η a ( x ) = x , η b ( x ) = 0 , and σ is the sigmoid function. These core formulations are derived from the research results reported in [22], enabling ACON to adaptively balance linear and nonlinear activation and effectively address the drawbacks of traditional ReLU-series functions. For detailed derivations and extended applications of the ACON family, refer to the original study [22].

4. Construction of an Automatic Polar Target Dataset Based on Prompts

4.1. Acquisition of Polar Target Dataset

In the research on polar environmental target detection, constructing a high-quality dataset is a crucial foundation. The specific requirement standard for the dataset in this study is to strive for authenticity, while there are no specific requirements for the resolution, environmental conditions, or other aspects of the dataset. Since there is no open-source dataset that meets the requirements of this study, images were collected through two methods: one was to apply a searching engine for polar-related images from public websites (as shown in Figure 5a,b), covering scenarios such as icebergs and sea ice in different forms), with a total of 998 JPG images obtained; the other was to extracting frames from polar videos (as shown in Figure 5c,d), including views of polar ship navigation and ice distribution), resulting in 344 JPG images. The two types of images total 1342, covering targets such as icebergs, sea ice, ice channels, and polar ships, providing data support for model training [6,7,21].
Polar environmental data are difficult to acquire, and datasets often have few samples and lack diversity. To address this, traditional data augmentation methods are used for optimization [5,8].
As shown in Table 2, the dataset covers four target categories (sea ice, ice channels, icebergs, and ships), with a total of 1342 labeled images. The dataset in this paper is divided in an 8:1:1 ratio, resulting in 1074 training set images, 134 test set images, and 134 validation set images. The training set is used for feature learning; the test set assesses generalization ability; and the validation set tunes and optimizes the model to improve performance stability [9].
The dataset has many notable characteristics and advantages. On the one hand, it has a diverse and representative source, covering typical scenarios and targets [4,10]. On the other hand, the images contain multiple types of targets, and through annotation and division, their usability is ensured, laying a foundation for model training and algorithm optimization [11].

4.2. Data Pre-Annotation Strategy Based on Grounding DINO

Current training relies heavily on manual annotation. This paper adopts the Grounding DINO pre-annotation strategy to reduce manual effort and improve efficiency [23].
Grounding DINO is an innovative open-set target detection model. Its architecture includes a visual encoder (using Vision Transformer to extract image features), a text encoder (converting text prompts into feature vectors), a feature enhancer (fusing multi-modal features), a language-guided query selection module (selecting relevant features), and a cross-modal decoder (processing queries and predicting target bounding boxes) [23].
As shown in Figure 6, for pre-annotation: first, the pre-trained Grounding DINO is loaded. Text prompts (e.g., “iceberg, sea ice, channel, ship”) are prepared. Encoders are used to extract image and text features. After feature enhancement, query selection, and cross-modal decoding, target bounding boxes are predicted, and preliminary annotations are generated, laying the foundation for manual refinement [23]. The pseudocode is shown in Table 3, Pseudocode for Data Pre-annotation Strategy Based on Grounding DINO, which outlines the steps as follows: load Grounding DINO model with pretrained weights; image_dataset = LoadJPEGData(); text_prompt = “iceberg, sea ice, channel, ship”; for each image in image_dataset, image_features = VisualEncoder(image), text_features = TextEncoder(text_prompt), fused_features = FeatureEnhancer(image_features, text_features), queries = LanguageGuidedQuerySelector(fused_features, text_features), bboxes = CrossModalityDecoder(queries), annotated_image = DrawBBoxes(image, bboxes, text_prompt), and SaveAnnotatedData(annotated_image).
After the automatic pre-annotation process, manual refinement is conducted using the LabelImg tool to ensure the accuracy of annotations. The refinement standards are defined as follows: (1) Bounding box position: Each bounding box must accurately enclose the target, without offset, omission of target parts, or inclusion of irrelevant regions. (2) Bounding box size: It should match the actual dimensions of the target, avoiding being excessively large (including redundant background) or too small (failing to fully cover the target). (3) Category annotation: Correct any mislabeled categories among “iceberg”, “sea ice”, “channel”, and “ship” to ensure precise classification [24].
This combined strategy of “Grounding DINO automatic annotation + manual refinement” achieves a balance between efficiency and accuracy in polar target dataset construction, thereby providing a robust data foundation for subsequent model training in polar environmental object detection [23,24].

4.3. Annotation Refinement and Dataset Augmentation Strategies

To enhance the accuracy, quality, and diversity of data for polar environmental target detection, an annotation refinement strategy and a dataset augmentation strategy are adopted. For annotation refinement, a “pre-annotation + manual adjustment” approach is used: the Grounding DINO model, combined with text prompts, preliminarily localizes targets to generate preliminary annotations, which are then manually corrected via the LabelImg tool to eliminate biases and ensure alignment with model training requirements [23,24]. For dataset augmentation, a hybrid strategy is implemented: first, Grounding DINO conducts target detection on generated images as well as images and videos from the polar target dataset, to filter out content containing relevant targets [23]; subsequently, traditional methods such as rotation, cropping, random scaling, Gaussian noise addition, contrast/brightness adjustment, Gaussian blur, and color enhancement are applied to the filtered data to generate synthetic images suitable for polar detection tasks [5,8]. Generated images can also undergo secondary filtering by Grounding DINO (guided by text prompts) to retain those with task-relevant targets [23]. These traditional augmentation methods simulate changes in actual scenarios, improving the model’s robustness and accuracy [5,8]. In summary, this hybrid strategy integrates Grounding DINO-based automatic annotation/filtering with traditional data augmentation to improve the diversity and quality of the polar target dataset [23,24].

5. Detecting Polar Environmental Targets Using the YOLO Algorithm

This paper uses three versions of YOLO models (YOLOv5n [25], YOLOv8n [3], and YOLOv11n [3]) to compare their performance in polar environmental target detection and training and evaluate them with the self-constructed polar dataset (1342 JPG images covering four classes: sea ice, icebergs, ice channels, and ships).

5.1. Polar Target Data Testing Based on YOLOv5n, YOLOv8n, and YOLOv11n

The network architecture of YOLOv5n is shown in Figure 7a [25]. Its structure includes key modules such as Focus, CBL, CSP1_X, CSP2_X, SPP, and PAN: the Focus module processes the 608 × 608 × 3 input image to reduce spatial dimensions while retaining feature details; the CSP1_X and CSP2_X modules realize cross-stage feature fusion through residual connections to avoid gradient disappearance; the SPP module enhances large-target feature extraction; and the PAN structure fuses multi-scale features (76 × 76 × 255, 38 × 38 × 255, and 19 × 19 × 255) to improve small-target detection [25].
The training results of YOLOv5n are shown in Figure 8. It can be observed that during the training and validation processes, losses including box_loss, cls_loss, and obj_loss all exhibit a decreasing trend (e.g., train/box loss in Figure 8a drops from 0.12 to around 0.06; val/box_loss in Figure 8b decreases from 0.10 to around 0.05), while precision and recall increase gradually. This reflects that the model gradually tends to converge during training, and its detection performance for polar targets improves progressively. Meanwhile, the validation loss follows a consistent trend with the training loss, indicating that the model does not have obvious overfitting [25].
YOLOv8n is selected as the baseline model for subsequent optimization, as it inherits the advantages of previous YOLO versions and incorporates in-depth improvements: it uses the C2f module and efficient feature fusion to enhance detection accuracy and adapt to complex polar scenarios; the Anchor-Free detection head simplifies the structure, improves inference speed, and enhances small-target detection while reducing computational cost and memory usage; it optimizes bounding-box regression with CIoU loss and DFL loss, and uses binary cross-entropy for classification to further strengthen detection robustness [3].
The YOLOv8n network consists of three core parts: the backbone (extracts features layer by layer from low to high levels), neck (adopts the PAN concept to fuse multi-scale features and improve detection accuracy), and head (uses a decoupled design for accurate target localization and classification). The training results of YOLOv8n are shown in Figure 8: both training and validation losses (including box_loss, cls_loss, and dfl_loss) present a decreasing trend (e.g., train/box loss in Figure 8e drops from 2.0 to around 1.2), while precision and recall increase gradually. The validation loss has the same trend as the training loss, indicating that the model does not suffer from significant overfitting and has good adaptability to polar environmental targets [3].
To facilitate an intuitive understanding of the convergence speed, final performance, and overfitting status of each model, we now conduct an integrated comparison. YOLOv5n starts with the lowest train/box loss (0.12) and val/box loss (0.10), with its loss curves declining smoothly and steadily, achieving the lowest final train/box loss (≈0.06) and val/box loss (≈0.05), and metrics/mAP50 reaching around 0.8; its loss curves are the smoothest, with validation loss strictly following training loss, showing no obvious overfitting. YOLOv8n, as the baseline model, starts with a train/box loss of 2.0, converging to around 1.2, balances efficiency and accuracy with metrics/mAP50 around 0.6 and metrics/mAP50-95 around 0.4, and its validation loss has the same trend as training loss, with slight fluctuations but no significant overfitting. YOLOv11n also starts with a train/box loss of 2.0 but converges slightly faster to around 1.0, outperforms in metrics/mAP50 and metrics/mAP50-95, demonstrating the strongest feature capture ability, and its validation loss is consistent with the training loss trend, with no obvious overfitting. Through this comparison, readers can clearly discern the strengths and characteristics of each model in convergence speed, final detection performance, and generalization ability, offering a clear reference for selecting appropriate models for different polar target detection scenarios.
The network architecture of YOLOv11n is shown in Figure 7b. Its structure includes modules such as C2PSA, C3k, C3k2, PSABlock, and SPPF: the C2PSA and PSABlock modules enhance feature extraction and attention to polar targets; the C3k and C3k2 modules optimize cross-stage feature fusion; and the SPPF module improves the extraction of large-target features (e.g., icebergs) [3].
The training results of YOLOv11n are shown in Figure 8. During both training and validation phases, losses including box_loss, cls_loss, and dfl_loss show a decreasing trend (e.g., the train/box loss in Figure 8i decreases from 2.0 to around 1.0; the val/box_loss in Figure 8j drops from 3.5 to around 2.0), while precision and recall increase gradually. This reflects that the model gradually converges during training. Notably, its convergence speed is slightly faster than that of YOLOv5n, and the consistent trend of validation loss and training loss further demonstrates that the model has no significant overfitting issue, with stable detection performance for polar environmental targets [3].

5.2. Data Testing of Polar Target Datasets with Different YOLO Versions

Three sets of experiments are conducted on the polar target dataset using different YOLO versions. Training is carried out in the same environment to ensure comparable results. As shown in Table 4, the evaluation metrics include average precision (AP, measuring the detection capability of a single category), mean average precision (mAP, comprehensively evaluating all categories), and detection rate (FPS, reflecting real-time performance) [16,17,18].
The comparison of test results of polar target datasets for different YOLO versions is shown in Figure 9.
From the above results, it can be seen that the accuracy of YOLOv5n [25] is slightly lower than that of YOLOv8n [3], and the model is only 3.9MB, making it suitable for deployment on resource-constrained devices; the mAP of YOLOv8n reaches 0.830, leading the way, and its detection speed (FPS) is the fastest, with a single-image detection time of 6.8 milliseconds, achieving a good balance between accuracy and real-time performance; and YOLOv11n performs well in small-target detection (higher accuracy and recall), but its detection speed is slightly slower (9.2 milliseconds per image).
By comparing the performances of YOLOv5n, YOLOv8n, and YOLOv11n in polar environmental target detection, the following conclusions can be drawn.
YOLOv5n is suitable for deployment on resource-constrained devices due to its lightweight nature; YOLOv8n achieves a good balance between accuracy and real-time performance, making it suitable for most application scenarios; and YOLOv11n performs well in small-target detection and is very suitable for identifying various small targets in polar environments. The polar target dataset proposed in this paper can effectively support the target detection tasks of YOLO models in polar environments [3,25].

5.3. Cross-Scene Transfer Adaptability of Improved Modules and Polar-Specific Value

Polar environments present unique extended challenges: low-temperature sensor noise, drastic polar day–night illumination changes, dynamic sea ice evolution, and low-power edge deployment constraints. The selection of CBAM, SE, SCConv, AKConv, and BiFPN is based on precise alignment between module functions and polar characteristics, validated by cross-scene applications and relevant studies [14,15].
Polar low temperatures cause sensor noise, while extreme illumination distorts target features. CBAM/SE enhances effective feature channels and spatial regions to suppress noise and extract weak signals [14]. Proven in low-temperature SAR image processing and low-light detection, these mechanisms improve feature distinguishability between sea ice and noise, ensuring robust detection under harsh lighting.
Polar detection relies on edge devices with limited computing power and solar endurance. SCConv’s “channel spatial reconstruction” reduces parameters and computation while mitigating overfitting on small polar datasets. Similar lightweight convolutions have enabled real-time, low-power detection on Antarctic UAVs, matching this study’s small-sample and deployment needs.
Polar sea ice evolves dynamically (calving, aggregation, and thawing), posing challenges for fixed-kernel convolutions. AKConv’s dynamic sampling adapts to irregular, changing morphologies, as validated in dynamic target detection studies. It improves capture of iceberg edges and broken ice distribution, enhancing recall for dynamic polar targets.
Polar day–night illumination causes overexposure of large targets and attenuation of small ones. BiFPN’s bidirectional weighted fusion calibrates unbalanced scale features [15], as demonstrated in Antarctic ice channel monitoring. It stabilizes detection accuracy across extreme lighting, outperforming traditional FPN in multi-scale target alignment.
These modules address polar-specific challenges: noise suppression (CBAM/SE [14]), low-power adaptation (SCConv), dynamic morphology capture (AKConv), and illumination calibration (BiFPN [15]). Their cross-scene validation ensures engineering practicality, balancing accuracy, efficiency, and deployment feasibility for polar scientific research and navigation.

5.4. Comparison of Improvement Effects

This experiment is based on the PyTorch framework and GPU (Windows 11 system, Intel Core i9-12900H, NVIDIA RTX3050Ti, CUDA 11.8, and Python 3.8). The epoch is set to 300, the batch size is 8, the learning rate is 0.01, the confidence threshold is 0.5, and the non-maximum suppression threshold is 0.3. The model is trained from scratch in this experiment. A single training takes about 7 h, and a total of 56 h is required. Figure 10 shows the distribution map of the polar target dataset, including the comparison of the number of instances of different categories, the size of YOLO anchor boxes, and the distribution of target center coordinates and width and height, which provides insight into the input characteristics and performance of the model.
As shown in Figure 11, it is a comparison effect diagram of the improved YOLOv8n + SCConv and YOLOv8n [3].
It can be seen from Figure 11 that the original YOLOv8n model has many missed detections when detecting ice channels, icebergs, and sea ice, while the improved YOLOv8n + SCConv model has fewer missed detections. The detection accuracy of ships in ice zones is high, and the original model is slightly better. The loss of the original model drops rapidly to 0.4 in the first 50 iterations, and the mAP rises to 0.83 after 200 iterations. In Experiment 5 (improved model), the loss also drops in the first 50 iterations and then decreases faster after 100 iterations. The mAP rises rapidly in the first 200 iterations and reaches 0.844 after 200 iterations, and stabilizes. The loss drops faster and has less oscillation, and the final average precision is better, proving that the improvement is effective.
The overall training workflow and performance evolution are summarized in Figure 12. To further analyze the effectiveness of the model improvement, this paper shows and compares the training results of different modules in detail. The following is the performance of each module during training:
For baseline YOLOv8n [3] (Figure 13, including the precision–recall (PR) curve, precision–confidence curve, F1–confidence curve, and recall–confidence curve), the highlight is that the “ship” category achieves the highest mAP of 0.942 and leads in precision across all confidence intervals. However, its performance on small targets such as sea ice and ice channels is limited, with an F1-score of only 0.423. In terms of practical significance, it is suitable for basic ship navigation scenarios, for example, serving as a visual supplement for the Automatic Identification System (AIS) in polar waters [1]. It ensures continuous ship identification when AIS signals are interfered with, thereby reducing collision risks. Nevertheless, its weakness in detecting small targets restricts its application in complex ice regions, such as areas with dense sea ice fragments [21].
Figure 14 shows the training result diagram of YOLOv8n improved by SCConv.
Through the PR curve, it can be observed that the mAP of “ship” is the highest at 0.922. Compared with the original model [3], the YOLOv8n improved by SCConv has improved recognition accuracy in “sea ice”, “channel”, and “iceberg”. The P curve shows that the average accuracy is 0.937, and the F1 is the highest at 0.441, when it is 0.83, indicating that the improved model can effectively detect targets, and the improvement is effective.
As shown in Figure 15, the training results of YOLOv8n with the BiFPN network fused are presented.
The PR curve shows that the mAP of “ship” is the highest at 0.927. The model using the BiFPN network has the most significant improvement in the detection effect of icebergs and is suitable for areas with frequent icebergs. The P curve shows that the average accuracy is 0.953, and the F1 is the highest at 0.312 when the confidence threshold is 0.80, indicating that YOLOv8n with the BiFPN network fused can effectively detect targets [15].
Experiments show that the polar target detection algorithm based on the improved YOLOv8n shows clear improvement over the baseline YOLOv8n. The SCConv convolution can extract small-target features in the polar target dataset, and fusing the BiFPN network can improve the detection effect. The improved YOLOv8n, combined with the polar target dataset, can complete the polar environmental target detection task and effectively detect polar environmental targets.

5.5. Error Analysis

TP (true positive), FP (false positive), TN (true negative), and FN (false negative) represent predicting positive samples correctly, predicting positive samples incorrectly, predicting negative samples correctly, and predicting negative samples incorrectly, respectively [16]. As shown in Table 5: Experiments show that the improved YOLOv8n + BiFPN [15] has the highest recall. The improved YOLOv8n + SCConv has the highest mean average precision (comprehensively reflecting the performance of multiple categories and the overall precision of the model) [17].
Table 6 illustrates that the improved YOLOv8n + SCConv shows a significant increase in mAP when detecting sea ice, icebergs, and ice channels, while its detection performance for ships is slightly lower than that of the original model. During ice zone navigation, sea ice and icebergs pose significant threats and are difficult to distinguish via radar. The YOLO algorithm can complement radar limitations. The improved model is adaptable to complex polar datasets, offering greater flexibility and adaptability for polar target detection.

6. Verification, Inference of Improved Model, and Visual GUI

6.1. Verification and Inference of Improved Model

The polar environment is special, and detection objects often have water surface reflections due to lighting and water characteristics [7]. To verify the generalization and robustness of the model, a dataset with water surface reflection features is selected for inference verification, simulating real scenarios and testing the model’s ability to handle complex interferences to ensure stable and accurate practical applications [6].
The reflection data verification results (Figure 16) show that the model can effectively avoid reflection interference and accurately recognize polar targets (such as icebergs), with strong robustness, suitable for polar target recognition and assisting navigators in navigation [1].
In maritime scenarios where UAVs take remote photos, model robustness is crucial. To verify the robustness of the proposed model, the improved YOLOv8n is compared with other YOLO versions (the recognition results are shown in Figure 17).
It can be seen from Figure 17 that YOLOv11n is prone to misdetection (misjudging sea ice as icebergs); all three models perform well in ship recognition, and the improved YOLO performs stably and balanced, capable of effectively detecting and recognizing polar targets [3,25].
As shown in Table 7, stable diffusion is a deep learning-based generative model that can generate high-quality images according to given text prompts. It is based on the combination of a variational autoencoder (VAE) and a diffusion model (diffusion model), generating images by gradually adding and removing noise [26].
These prompts cover main targets in the polar environment, such as icebergs, sea ice, and polar ships, as well as different scenarios and conditions. Figure 18 shows AI images generated using stable diffusion [26].
To verify the model’s robustness more comprehensively, AI-generated images are used to verify the model. For AI images, the improved model can successfully recognize targets in polar images, as shown in Figure 18. However, there are missed detections and misdetections; some small-sized sea ice is not recognized, leaving room for future improvement.

6.2. Classification Performance Comparison: Confusion Matrix Analysis

Figure 19 compares the confusion matrices, which serve as pivotal visualization tools to quantify a model’s classification performance, illustrating the rates of correct and incorrect classifications across categories (sea ice, ice channel, iceberg, ship, and background) [16]. To intuitively compare the classification details of YOLOv5n, YOLOv8n, and YOLOv11n, the three confusion matrices are presented and analyzed below.
Diagonal values (correct classification rates): For the core categories, YOLOv8n achieves a correct classification rate of 0.84 for channel, outperforming YOLOv5n (0.79) and YOLOv11n (0.83). In sea ice, YOLOv5n and YOLOv11n both reach 0.75, while YOLOv8n scores 0.73. For iceberg, YOLOv5n and YOLOv11n both hit 0.78, and YOLOv8n attains 0.80. All three models perform robustly in ship classification, with YOLOv5n at 0.95, YOLOv8n at 0.94, and YOLOv11n at 0.95.
Background misclassification: The misclassification of the background category into target classes is relatively low across all models. Notably, YOLOv8n exhibits the lowest misclassification rate in channel (0.16) compared with YOLOv5n (0.21) and YOLOv11n (0.17), indicating stronger robustness against non-target interference.
In summary, the confusion matrix analysis reveals that YOLOv8n demonstrates superior classification accuracy for channel, while YOLOv5n and YOLOv11n maintain comparable performance in sea ice, iceberg, and ship. These insights explain the variations in overall mAP metrics and highlight the models’ strengths in recognizing specific categories.

6.3. Visual GUI

A graphical user interface (GUI) realizes human–computer interaction through graphical elements such as windows and icons. Users do not need to remember command lines, providing a convenient and intuitive operation method, improving efficiency and experience, reducing learning costs, and optimizing the user experience. The GUI is developed and run on a computer with the following configuration: a Windows 11 operating system, a colorful-integrated Intel Core i9-12900H CPU, an NVIDIA RTX 3050Ti GPU, CUDA 11.8, and Python 3.8 as the development language. To ensure the basic operation of the system for users with different hardware conditions, the minimum hardware configuration requirements are clearly defined: the CPU requires Intel Core i5-10400F/AMD Ryzen 5 3500X or above (quad-core or more; main frequency ≥ 2.5 GHz); memory ≥ 8 GB DDR4; the GPU needs NVIDIA GeForce GTX 1050Ti or above (video memory ≥ 4 GB; computing power ≥ 6.0 to support CUDA acceleration); storage ≥ 256 GB SSD (to ensure fast reading of image and video files); and the operating system supports Windows 10 64-bit or later, and matches CUDA 10.2+ and Python 3.6+ environments.
During polar navigation, drivers’ visual observation of sea ice is prone to visual fatigue, leading to safety hazards [1]. To facilitate ice condition monitoring, this visual GUI system is introduced, which can detect polar sea ice images and videos. During actual navigation, seafarers use drones to shoot ice conditions and upload them. The GUI identifies and analyzes the ice conditions near the route, reducing driving pressure. The design follows principles of simplicity and intuitiveness, helping users quickly understand operations, reducing human risks and observation errors, providing tools for polar exploration and monitoring, ensuring navigation safety, and improving crew adaptability [3].
At the top of the interface, there are “Image Detection” and “Video Detection” navigation buttons to switch modes; the middle part displays function title bars according to the mode. In image mode, the center is an upload area decorated with icons, with “Upload Image” and “Start Detection” buttons below, and target categories and confidence levels displayed on the right. In video mode, the center is an upload area, supporting real-time monitoring and file detection, with three operation buttons below.
Functional process: The user clicks “Upload Image” to select a local file, which is automatically adapted in size and displayed on the left. Clicking “Start Detection” calls the model for inference in the background (aided by the GPU and CUDA acceleration; even under the minimum configuration, the inference time of a single 1080P image is controlled within 500 ms). In video mode, target recognition is carried out through a camera or file; the video stream is processed frame by frame, and results are displayed in a timely manner. Clicking “Stop Detection” releases resources and resets the interface, and a prompt is given if no upload is made.
Through this design, the GUI has image and video recognition, real-time feedback, and error-handling functions, ensuring task completion. The diagrams include various state interfaces of image and video detection, presenting the system process and effects.

7. Limitations and Conclusions

7.1. Limitations

Although this work has achieved certain progress, several limitations remain. The polar environmental target dataset is mainly collected from web images and frames extracted from polar videos. The data are manually filtered, carefully annotated, and augmented to improve quality and diversity, but the scene types, weather conditions, and viewing angles are still limited. Images from more extreme weather, complex illumination, and different platforms or sensors are not fully included, which restricts the generalization of the model in more complex polar environments.
The model design and comparison focus on YOLOv5n, YOLOv8n, and YOLOv11n. The detection performance of different improvement strategies is analyzed, but the study does not include comparisons with a wider range of object detection algorithms. The trade-off among detection accuracy, robustness, and computational cost is not fully examined, and further experiments based on richer model families and parameter settings are needed.
Finally, the model evaluation is mainly conducted through offline tests on the self-built dataset and several typical scenarios. The GUI-based system can perform image and video detection, but it has not yet been validated through long-term deployment on real shipborne platforms or in polar scientific missions. Its real-time performance, stability, and collaboration with other navigation devices in complex engineering environments still require further verification in future work.

7.2. Conclusions

This research tackles polar target detection challenges using deep learning. A dataset with 1342 JPG images (covering icebergs, sea ice, etc.) was constructed, and Grounding DINO pre-annotation, followed by LabelImg-based manual refinement, ensured annotation quality for model training [23,24].
Focusing on YOLO series algorithms, we improved YOLOv8n by integrating the CBAM/SE attention mechanism (to emphasize key features), SCConv/AKConv convolutions (to adapt to complex shapes and multi-scale targets), and BiFPN (for efficient feature fusion) [14,15]. Trained via the PyTorch framework and GPU resources, the improved model reduces missed detections and boosts precision, with metrics like mAP confirming its suitability for polar tasks [16,17]. A GUI for polar target recognition (supporting image upload and detection) was also designed for user-friendliness.
In summary, the improved YOLOv8n-based intelligent recognition system lays a technical foundation for polar monitoring and research, enhancing ice zone navigation safety efficiency and advancing China’s polar shipping and expeditions [1,10].
However, polar target detection has room for improvement: (1) Integrate multi-source sensors (infrared thermal imaging and radar) to build a multi-modal system (e.g., infrared–visible light fusion) for robustness against polar lighting [4,12], and optimize inference efficiency via model lightweighting techniques (pruning and quantization) and GPU acceleration. (2) Enhance small-target detection with high-resolution feature maps, super-resolution reconstruction, and data augmentation (e.g., random scaling) [5,8]. (3) Combine point cloud processing and 3D CNNs to enable 3D reconstruction (e.g., sea ice thickness) beyond 2D image limits. (4) Explore online/incremental learning to adapt to the dynamic polar environment.
Additionally, polar detection achievements can expand to domains like ecological protection (wildlife distribution monitoring), resource development (mineral/fishery management), and tourism (intelligent navigation) [2], supporting polar undertakings across science, industry, and society.

Author Contributions

Conceptualization, J.J. and Z.W.; methodology, K.S.; software, K.S.; validation, K.S.; formal analysis, J.G. and R.G.; investigation, Z.W.; resources, Z.W.; data curation, J.J., K.S. and Z.W.; writing—original draft preparation, J.G.; writing—review and editing, R.G.; visualization, K.S.; supervision, J.J.; project administration, Z.W.; funding acquisition, J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant no. 42261144671, the National Key R&D Program of China under grant no. 2024YFE0103200, the Innovation Group Project of Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai) under grant no. 311024001.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

Thank you to the reviewers for their helpful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, X.; Zhao, J.; Zhao, Y. Risk assessment of ice-class-based navigation in the Arctic: A case study in the Vilkitsky Strait. J. Phys. Conf. Ser. 2024, 2718, 012040. [Google Scholar] [CrossRef]
  2. Liu, M.; Yan, R.; Zhang, X. Sea ice recognition for CFOSAT SWIM at multiple small incidence angles in the Arctic. Front. Mar. Sci. 2022, 9, 986228. [Google Scholar] [CrossRef]
  3. Terven, J.; Cordova-Esparza, D.M.; Romero-Gonzalez, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
  4. Zhang, D.; Fan, H.; Kang, B.; Gao, J.; Li, T. Glacier remote sensing image detection method in complex background based on improved U-Net network. J. Basic Sci. Eng. 2022, 30, 806–818. [Google Scholar] [CrossRef]
  5. Fraser, A.D.; Massom, R.A.; Michael, K.J. A Method for Compositing Polar MODIS Satellite Images to Remove Cloud Cover for Landfast Sea-Ice Detection. IEEE Trans. Geosci. Remote Sens. 2009, 47, 3272–3282. [Google Scholar] [CrossRef]
  6. Gupta, M.; Barber, D.G.; Scharien, R.K.; Isleifson, D. Detection and classification of surface roughness in an Arctic marginal sea ice zone. Hydrol. Process. 2014, 28, 599–609. [Google Scholar] [CrossRef]
  7. Hong, S. Detection of small-scale roughness and refractive index of sea ice in passive satellite microwave remote sensing. Remote Sens. Environ. 2010, 114, 1136–1140. [Google Scholar] [CrossRef]
  8. Han, Y.; Li, J.; Zhang, Y.; Hong, Z.; Wang, J. Sea ice detection based on an improved similarity measurement method using hyperspectral data. Sensors 2017, 17, 1124. [Google Scholar] [CrossRef] [PubMed]
  9. Zhou, H.; Zhou, Z.; Li, X.; Wang, Z.; Liu, Y. Algorithm to Detect the Ship Wake from ERS-1 SAR Ocean Imagery. J. Remote Sens. 2000, 4, 55–60. [Google Scholar]
  10. Zheng, M.; Li, X.; Ren, Y. Research on automatic detection method of polar sea ice using Gaofen-3 spaceborne synthetic aperture radar. Acta Oceanol. Sin. 2018, 40, 113–124. [Google Scholar]
  11. Zheng, M. Research on C-band spaceborne synthetic aperture radar polar sea ice detection algorithm. Master’s Thesis, University of Chinese Academy of Sciences (Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences), Beijing, China, 2018. [Google Scholar]
  12. Zhao, C.; Xu, R.; Zhao, K. Research on polar sea ice detection method based on HY-2A/SCAT data. Period. Ocean Univ. China (Nat. Sci. Ed.) 2019, 49, 140–149. [Google Scholar] [CrossRef]
  13. Xu, C. Research on Polar Sea Ice Detection Methods Using Multi-Source Satellite Scatterometers. Master’s Thesis, Nanjing University of Information Science & Technology, Nanjing, China, 2023. [Google Scholar]
  14. Pang, B. Classification of images using EfficientNet CNN model with convolutional block attention module (CBAM) and spatial group-wise enhance module (SGE). In Proceedings of the International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2022), Guilin, China, 25–28 February 2022; SPIE: Cergy-Pontoise, France; Volume 12247, pp. 34–41. [Google Scholar]
  15. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
  16. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  17. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
  18. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
  19. Wu, C.D.; He, X.; Wu, Y. An object detection method for catenary component images based on improved Faster R-CNN. Meas. Sci. Technol. 2024, 35, 085406. [Google Scholar] [CrossRef]
  20. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector; Springer: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
  21. Wei, F.; Ke, C. Glacier recognition in the Yigong Zangbo Basin based on optical and thermal infrared remote sensing images. China Sci. 2016, 11, 6. [Google Scholar]
  22. Jian, J.; Zhang, Y.; Xu, K.; Webster, P.J. Automatic Reading and Reporting Weather Information from Surface Fax Charts for Ships Sailing in Actual Northern Pacific and Atlantic Oceans. J. Mar. Sci. Eng. 2024, 12, 2096. [Google Scholar] [CrossRef]
  23. Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection; European Conference on Computer Vision; Springer: Cham, Switzerland, 2025. [Google Scholar] [CrossRef]
  24. Wu, H.; Meng, Z.; Wang, J. A Rapid Annotation Method for Six-Axle Tarpaulin Trucks Based on Labelimg and Easydata. CN Patent CN117911809A, 19 April 2024. [Google Scholar]
  25. Qiu, T.; Wang, L.; Wang, P.; Bai, Y. Research on object detection algorithm based on improved YOLOv5. Comput. Eng. Appl. 2022, 58, 63–73. [Google Scholar]
  26. Hennicke, L.; Adriano, C.M.; Giese, H.; Koehler, J.M.; Schott, L. Mind the Gap Between Synthetic and Real: Utilizing Transfer Learning to Probe the Boundaries of Stable Diffusion Generated Data. arXiv 2024, arXiv:2405.03243. [Google Scholar] [CrossRef]
Figure 1. Four types of polar targets: (a) sea ice image, (b) iceberg image, (c) ice channel, and (d) icebreaker “XUE LONG 2”.
Figure 1. Four types of polar targets: (a) sea ice image, (b) iceberg image, (c) ice channel, and (d) icebreaker “XUE LONG 2”.
Jmse 13 02313 g001
Figure 2. The structure of convolutional block attention module, where “*” denotes the multiplication operation.
Figure 2. The structure of convolutional block attention module, where “*” denotes the multiplication operation.
Jmse 13 02313 g002
Figure 3. AKConv convolution diagram.
Figure 3. AKConv convolution diagram.
Jmse 13 02313 g003
Figure 4. BiFPN feature fusion network.
Figure 4. BiFPN feature fusion network.
Jmse 13 02313 g004
Figure 5. Dataset collection examples for polar environmental target detection: (a,b) images collected from https://m.51miz.com/so-tupian/7413673.html (accessed on 13 October 2025); (c,d) frames extracted from polar videos.
Figure 5. Dataset collection examples for polar environmental target detection: (a,b) images collected from https://m.51miz.com/so-tupian/7413673.html (accessed on 13 October 2025); (c,d) frames extracted from polar videos.
Jmse 13 02313 g005aJmse 13 02313 g005b
Figure 6. Principle diagram of Grounding DINO.
Figure 6. Principle diagram of Grounding DINO.
Jmse 13 02313 g006
Figure 7. Network architecture comparison of (a) YOLOv5n and (b) YOLOv11n.
Figure 7. Network architecture comparison of (a) YOLOv5n and (b) YOLOv11n.
Jmse 13 02313 g007
Figure 8. Training result diagrams of YOLOv5n (ad), YOLOv8n (eh), and YOLOv11n (il), where blue lines indicate raw values per iteration, and orange dots show smoothed trends to highlight convergence.
Figure 8. Training result diagrams of YOLOv5n (ad), YOLOv8n (eh), and YOLOv11n (il), where blue lines indicate raw values per iteration, and orange dots show smoothed trends to highlight convergence.
Jmse 13 02313 g008
Figure 9. Comparison of data test results for polar targets using different versions of YOLO. From left to right, YOLOv5n, YOLOv8n, and YOLOv11n, and from top to bottom, icebergs, ships and channels, sea ice, and ships.
Figure 9. Comparison of data test results for polar targets using different versions of YOLO. From left to right, YOLOv5n, YOLOv8n, and YOLOv11n, and from top to bottom, icebergs, ships and channels, sea ice, and ships.
Jmse 13 02313 g009
Figure 10. Polar targets data distribution diagram, where different colors indicate different target types.
Figure 10. Polar targets data distribution diagram, where different colors indicate different target types.
Jmse 13 02313 g010
Figure 11. Comparison visualization of YOLOv8n+ SCConv and YOLOv8n, from left to right, and YOLOv8n + SCConv and YOLOv8n, from top to bottom, are ships and ice channel, ships and sea ice, sea ice, icebergs, and sea ice.
Figure 11. Comparison visualization of YOLOv8n+ SCConv and YOLOv8n, from left to right, and YOLOv8n + SCConv and YOLOv8n, from top to bottom, are ships and ice channel, ships and sea ice, sea ice, icebergs, and sea ice.
Jmse 13 02313 g011aJmse 13 02313 g011b
Figure 12. Training process visualization.
Figure 12. Training process visualization.
Jmse 13 02313 g012
Figure 13. YOLOv8n training results diagram.
Figure 13. YOLOv8n training results diagram.
Jmse 13 02313 g013
Figure 14. YOLOv8n training results based on SCConv.
Figure 14. YOLOv8n training results based on SCConv.
Jmse 13 02313 g014
Figure 15. YOLOv8n training results with BiFPN network integration.
Figure 15. YOLOv8n training results with BiFPN network integration.
Jmse 13 02313 g015aJmse 13 02313 g015b
Figure 16. Image recognition results interface.
Figure 16. Image recognition results interface.
Jmse 13 02313 g016
Figure 17. Image recognition result interface, from top to bottom, are the channel and ship, sea ice, and icebergs, and from left to right, YOLOv5n, YOLOv8n, and YOLOv11n.
Figure 17. Image recognition result interface, from top to bottom, are the channel and ship, sea ice, and icebergs, and from left to right, YOLOv5n, YOLOv8n, and YOLOv11n.
Jmse 13 02313 g017aJmse 13 02313 g017b
Figure 18. Polar environment target recognition map generated by stable diffusion.
Figure 18. Polar environment target recognition map generated by stable diffusion.
Jmse 13 02313 g018
Figure 19. Confusion matrix comparison of YOLOv5n, YOLOv8n, and YOLOv11n for classification performance.
Figure 19. Confusion matrix comparison of YOLOv5n, YOLOv8n, and YOLOv11n for classification performance.
Jmse 13 02313 g019
Table 1. Comparison table with Faster R-CNN, SSD, and RetinaNet.
Table 1. Comparison table with Faster R-CNN, SSD, and RetinaNet.
IndexYOLOv8nFaster R-CNNRetinaNetSSD
Detection speedFasterSlowerSlowerSlightly slower
Detection accuracySlightly lowerHigherHigherHigher
Memory occupationSmallerLargerLargerLarger
Table 2. Distribution of images by target labels in the polar targets dataset.
Table 2. Distribution of images by target labels in the polar targets dataset.
Label NameIcebergSea IceIce ChannelShipTotal
Number of images3545882614391342
Images collected from websites263438194327998
Frames extracted from polar videos9115067112344
Table 3. Pseudocode for data pre-annotation strategy based on Grounding DINO.
Table 3. Pseudocode for data pre-annotation strategy based on Grounding DINO.
Pre-Annotation Pseudocode
Load Grounding DINO model with pretrained weights
image_dataset = LoadPEGData()
text_prompt = “iceberg, sea ice, channel, ship”
for each image in image_dataset:
image_features = VisualEncoder(image)
text_features = TextEncoder(text_prompt)
fused_features = FeatureEnhancer (image_features, text_features)
queries = LanguageGuidedQuerySelector (fused_features, text_features)
bboxes = CrossModalityDecoder(queries)
annotated_image = DrawBBoxes (image, bboxes, text_prompt)
SaveAnnotatedData(annotated_image)
Table 4. Comparison of different versions of YOLO.
Table 4. Comparison of different versions of YOLO.
ExperimentDetection ModelmAPPRF1Single-Image Detection Time (ms)Weight (MB)
1YOLOv5n0.8250.9510.8700.81013.33.9
2YOLOv8n0.8300.9570.8800.8306.86.2
3YOLOv11n0.8150.9690.8700.8109.25.4
Table 5. Comparison of training results for improved YOLOv8n (highest value bolded).
Table 5. Comparison of training results for improved YOLOv8n (highest value bolded).
ExperimentDetection ModelmAPPRF1Single-Image Detection Time (ms)Weight (MB)
1YOLOv8n0.8300.9570.880.836.86.2
2YOLOv8n + SE0.8100.9330.880.797.012.2
3YOLOv8n + CBAM0.8030.8890.880.798.012.2
4YOLOv8n + BiFPN0.8290.9530.910.808.15.9
5YOLOv8n + SCConv0.8440.9370.910.836.86.0
6YOLOv8n + AKConv0.8140.9280.890.789.26.2
Table 6. Comparison chart of polar targets data training results (highest value bolded).
Table 6. Comparison chart of polar targets data training results (highest value bolded).
Detection ModelSea IceChannelIcebergShipmAP@50
YOLOv8n0.7360.8430.8020.9420.830
YOLOv8n + SE0.7380.7850.7800.9370.810
YOLOv8n + CBAM0.7460.7470.7890.9280.803
YOLOv8n + BiFPN0.7460.7830.8610.9270.829
YOLOv8n + SCConv0.7510.8550.8490.9220.844
YOLOv8n + AKConv0.7290.7190.8700.9390.814
Table 7. Image generation process of stable diffusion based on predefined text prompts.
Table 7. Image generation process of stable diffusion based on predefined text prompts.
Image Generation via Text Prompts
“A large iceberg floating in the Arctic Ocean with clear sky.”
“Sea ice covering the Antarctic coastline with penguins in the background.”
“A polar bear walking on thin sea ice in the Arctic.”
“A ship navigating through dense sea ice in the Arctic.”
“A group of icebergs in the Arctic with aurora borealis in the sky.”
“A close-up view of a melting iceberg in the Arctic.”
“A ship stuck in thick sea ice in the Antarctic.”
“A wide shot of the Arctic landscape with multiple icebergs and sea ice.”
“A ship breaking through sea ice in the Arctic.”
“A detailed view of a ship’s hull interacting with sea ice.”
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jian, J.; Wu, Z.; Sun, K.; Guo, J.; Gao, R. Development and Application of an Intelligent Recognition System for Polar Environmental Targets Based on the YOLO Algorithm. J. Mar. Sci. Eng. 2025, 13, 2313. https://doi.org/10.3390/jmse13122313

AMA Style

Jian J, Wu Z, Sun K, Guo J, Gao R. Development and Application of an Intelligent Recognition System for Polar Environmental Targets Based on the YOLO Algorithm. Journal of Marine Science and Engineering. 2025; 13(12):2313. https://doi.org/10.3390/jmse13122313

Chicago/Turabian Style

Jian, Jun, Zhongying Wu, Kai Sun, Jiawei Guo, and Ronglin Gao. 2025. "Development and Application of an Intelligent Recognition System for Polar Environmental Targets Based on the YOLO Algorithm" Journal of Marine Science and Engineering 13, no. 12: 2313. https://doi.org/10.3390/jmse13122313

APA Style

Jian, J., Wu, Z., Sun, K., Guo, J., & Gao, R. (2025). Development and Application of an Intelligent Recognition System for Polar Environmental Targets Based on the YOLO Algorithm. Journal of Marine Science and Engineering, 13(12), 2313. https://doi.org/10.3390/jmse13122313

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop