1. Introduction
Arctic sea ice has been shrinking since the 1980s, opening up new maritime opportunities such as the Northeast, Northwest, and Central Passages [
1,
2]. However, navigating these extreme environments is challenging due to dynamic sea ice, which poses a considerable threat to vessel safety [
3]. Therefore, accurate sea ice observation and reliable path planning are essential for safe navigation and are becoming major focuses of polar research.
Acquiring sea ice information predominantly relies on three methodologies: field measurements, ship-based observational surveys, and satellite remote sensing. In this context, Lu et al. [
4] developed a two-stream radiative transfer model for ponded sea ice, while Weissling et al. [
5] used video to capture ice dynamics. For large-scale assessments, Worby et al. [
6] analyzed over 20,000 samples from ship transects. Recently, computational approaches have significantly enhanced data processing, with researchers like Zhou et al. [
7] applying convolutional neural networks for instance segmentation of sea ice, and Ressel et al. [
8] using artificial neural networks for classification. Chen et al. [
9] pioneered a 3D reconstruction methodology for sea ice fields by integrating YOLO-based object detection. However, some of the aforementioned methods lack optimization for small-target sea ice, which may lead to difficulties in achieving complete identification of small sea ice floes in practical applications, and missed detection of sea ice could adversely affect obstacle avoidance decision-making for safe navigation. Although other approaches have proposed improvement strategies specifically for small-target sea ice, their model training in practice relies on large-scale datasets of ship-based sea ice imagery. Such datasets must be obtained through field observations by polar research vessels, making their acquisition significantly more challenging than that of open-source satellite remote sensing images.
The accurate extraction of sea ice feature parameters, particularly for image recognition, has become a crucial research focus. Current methodologies primarily use threshold segmentation, target recognition, and instance segmentation techniques [
10]. Among target recognition algorithms, YOLO, SSD, and Mask R-CNN are predominant [
11,
12]. For instance, Lu et al. [
13] introduced a fusion-based segmentation method for rocks, Zhang et al. [
14] refined YOLOv5 for occlusion challenges, and Wu et al. [
15] proposed SPE-YOLO using SE attention for small target detection. Other studies include Cai et al. [
16], who used CNNs for sea ice instance segmentation, and Dong et al. [
17], who developed a two-stage ice channel identification approach.
While these advancements in sea ice identification have improved situational awareness, they often do not directly provide a safe navigation plan. This has led to parallel research in polar path planning, where factors like sea ice concentration and thickness are considered. For instance, Shu et al. [
18] employed an optimal control-based method, integrated with macro-scale sea ice concentration and thickness grid data, to conduct path planning for ship fleets in the Northern Sea Route, while distinguishing between breakable and unbreakable ice to optimize navigation costs. Zhang et al. [
19] utilized a three-dimensional ant colony algorithm (3D-ACA), representing the ice field with average concentration and thickness of discrete grids, and took ship speed as a control variable to carry out multi-objective path planning for Arctic ships that balances fuel consumption and navigation risk optimization. Liu et al. [
20] constructed a polar path planning background using sea ice concentration data from NSIDC and sea ice thickness data from PIOMAS, and proposed the D*-NSGA-III dynamic multi-objective path planning algorithm to conduct research on ship path planning in the Arctic region. Lehtola et al. [
21] adopted an improved A*-based algorithm, combined with sea ice concentration, thickness data provided by the HELMI ice model and a ship-ice interaction model, to conduct safe and efficient route planning in ice-covered waters. Xu et al. [
22] used an improved D* Lite algorithm, modified local update and path extraction rules based on gridded sea ice concentration and thickness data, to conduct dynamic path planning for ships in Arctic waters.
However, a significant limitation of these existing approaches is their reliance on simplified or generalized ice models, often based on satellite data of sea ice concentration or thickness. These methods lack attention to the actual distribution positions of sea ice, which are crucial for fine-grained real-ship path planning. Our research addresses this gap by proposing a novel framework that, for the first time, utilizes a vision-based approach to construct a high-fidelity, realistic ice field model, enabling more precise and safer navigation planning. This study presents a comprehensive solution by integrating advanced sea ice detection with intelligent path planning to address the critical safety challenge posed by ship-ice collisions in polar navigation. Our approach leverages YOLOv5-ICE, an incremental YOLOv5 improvement that incorporates three key modules: Squeeze-and-Excitation attention mechanisms, improved spatial pyramid pooling, and Flexible ReLU activation. This model achieves superior performance in sea ice target identification, especially in terms of detecting small floes in high-concentration ice regions. The detected ice parameters then inform our Any-Angle Path Planning algorithm that optimizes routes based on multiple safety factors including path length, maneuver complexity, which is measured by turns, and ice collision risk. This integrated detection-planning framework represents a strong engineering contribution in polar navigation safety, providing ships with reliable, ice-adaptive routing solutions.
2. Sea Ice Image Detection Algorithm
2.1. Model of YOLOv5
YOLO is an end-to-end target detection algorithm based on deep learning, which has a faster detection speed and can meet real-time requirements when ships are sailing in polar regions [
23]. The model structure is simple and efficient, with strong scalability. The detection accuracy of small targets is relatively high, which is suitable for the identification of small sea ice targets in remote sensing images in complex polar environments. The network structure of YOLOv5 mainly contains Input, Backbone, Neck, and Head. The network structure is shown in
Figure 1. In which, the Input layer takes images with a pixel size of 640 × 640 as input; the Backbone adopts the CSPDarknet53 structure, and gradually extracts sea ice features from images through multiple sets of convolution, residual connection and pooling operations; the Neck combines the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) to realize the fusion of features at different levels, effectively solving the feature matching problem of sea ice targets caused by scale differences; the Head outputs the category probability, confidence and bounding box coordinates of sea ice targets through the detection head, realizing the localization and recognition tasks of sea ice targets in remote sensing images.
The input side is responsible for pre-processing the input image, including operations such as resizing and normalization, to ensure the consistency of the input data. The input images are scaled uniformly to the standard input size to meet the input requirements of the network. In order to avoid distortion or information loss introduced during the scaling process, this preprocessing step adopts the method of maintaining the aspect ratio of the image by first calculating the aspect ratio of the image and then scaling it to the standard size uniformly according to this ratio, and the blank area is filled with grayscale bars.
The backbone network contains CBS structure, C3 structure, and SPPF (Spatial Pyramid Pooling with Features) structure. These modules together form an efficient and less computationally intensive feature extraction network. The CBS consists of three components, namely Convolution, Batch Normalization, and SiLU (Sigmoid-Weighted Linear Unit) activation function. The C3 structure consists of three standard Convolution and Bottleneck layer modules. Layer and Bottleneck Layer modules, with the C3 structure, features can be extracted and fused efficiently while maintaining a low computational complexity. SPPF uses different sizes of maximal pooling to increase the sensory field. The feature map is manipulated using pooling layers of different scales and used for the construction of the feature pyramid to obtain a multi-scale feature representation.
The Neck network of YOLOv5 is located between the backbone network and the output, which is responsible for fusing feature maps of different scales to enhance the feature expression capability of the network. The Neck module is composed of a Feature Pyramid Network (FPN) as well as a Pyramid Attention Network (PAN). The FPN structure transmits high-level semantic features from top to bottom, and the PAN structure realizes comprehensive information coverage by transmitting low-level spatial features downward so that the feature maps of each size contain both semantic and spatial information of the target.
The output side is the last part of YOLOv5, which is responsible for outputting the target category and correcting the position of the candidate box according to the position offset to get more accurate detection results. On the output side, the architecture employs a convolutional layer that transforms the feature map into final prediction results. This transformation is achieved by performing spatial-wise convolutions to directly map each grid cell to predictions containing target bounding box positions, dimensional attributes, and corresponding class probability distributions. To enhance prediction accuracy, the system incorporates adaptively sized anchor frames that serve as dimensional priors, where these anchor dimensions are statistically derived from the distribution of ground truth boxes in the training dataset through a clustering optimization process. The output covers the loss function and Non-Maximum Suppression (NMS), which work together to improve the stability and reliability of the network.
NMS is a commonly used technique in the field of computer vision to address the issue of overlapping bounding boxes in object detection algorithms. The goal of NMS is to retain the best candidate box by suppressing non-maximum values. First, the candidate boxes are scored, typically using the Intersection over Union (IoU) as the scoring metric. IoU measures the ratio of the intersection area to the union area between a candidate box and the target, serving as an indicator of the overlap between two regions. Next, all candidate boxes B are sorted in descending order based on their scores. The box with the highest score is selected and retained. For the remaining boxes, their IoU values are calculated against the retained box. If the IoU exceeds a predefined threshold M, the box is discarded; otherwise, it is kept. Finally, the candidate boxes processed through NMS are obtained, ensuring no overlap between them and retaining only the highest-scoring boxes.
For the sliding window approach, adjacent image segments have a 15% overlap, as shown in
Figure 2. This overlap ensures that every region of the image is fully covered for detection. Although this approach introduces some redundant detections, they can be filtered out using NMS. Here, another NMS threshold is set to 0.5, which filters out boxes with IoU values below this threshold. After processing the entire image, the detection results from each cropped segment are merged to produce the final detection outcome. The NMS algorithm is utilized to filter the best result from multiple prediction frames and eliminate redundant detection [
24].
2.2. The Optimization of the YOLOv5
The main difference between remote sensing satellite image recognition and general image recognition is that its image size is huge, while the target size in it is small and usually clustered together, resulting in recognition difficulties, as shown in
Figure 3 [
25]. The sea ice targets to be recognized are very small and difficult to recognize relative to the large size of the remote sensing image, and in areas of dense sea ice, multiple pieces of sea ice may also be close to each other or partially obscured.
To address the aforementioned challenges in remote sensing sea ice image recognition, we implemented three targeted modifications to the YOLOv5 algorithm to enhance its performance in detecting sea ice targets of varying sizes.
First, we integrated a Squeeze-and-Excitation Network (SE) attention mechanism into the backbone network. In complex remote sensing scenarios, different feature channels exhibit varying degrees of importance for sea ice detection. The SE mechanism explicitly models inter-channel dependencies, dynamically weighting each feature channel to selectively enhance ice-relevant characteristics (e.g., texture and boundary features) while suppressing background interference. This architectural modification enables the network to significantly improve its representational capacity and generalization performance.
Second, we enhanced the original Spatial Pyramid Pooling Fast (SPPF) module by developing the SPPCSPC-F structure. The substantial size variation in sea ice targets, ranging from small floes to extensive ice fields, presents significant multi-scale detection challenges. Our improved SPPCSPC-F architecture facilitates more effective fusion of multi-scale features, thereby strengthening the model’s capability to represent sea ice characteristics while maintaining detection accuracy across different target sizes.
Finally, we substituted the original SiLU activation function with a Flexible Rectified Linear Unit (FReLU). Given the importance of subtle pixel-level information in sea ice imagery, FReLU’s ability to preserve negative value information proves particularly advantageous compared to SiLU. This modification reduces information loss during feature extraction, enabling the model to learn more discriminative representations, which is a critical improvement for accurate ice detection in low-contrast or obscured regions.
2.2.1. Squeeze-And-Excitation Networks (SE) Attention Mechanism
Attention mechanisms mainly contain three types: spatial, channel, and hybrid domains. The SE model is a typical representative of the channel domain attention mechanism, focusing on the adaptive weight assignment of feature channels. In the working mechanism of this model, the feature map first undergoes compression in the spatial dimension, and then different weight values are applied to each feature channel, which characterizes the relative importance of the information carried by each channel [
26]. The initial feature map is recalibrated according to the obtained weights to realize the enhancement of critical feature channels and the suppression of non-critical channels.
Due to the large size of the remote sensing image and the small size of the target sea ice, it is easy to lose some key information (such as the textures and edge contours of sea ice, etc.) when performing the identification, and there is a leakage of small target sea ice. For this reason, the SE attention mechanism is added in layer 9 of the YOLOv5 backbone network Backbone. The core part of this module is the two steps of Squeeze and Excitation, which adaptively adjusts the importance of each channel by learning the weights so that the neural network can better capture the features of the sea ice image, and improve the performance of the network without adding too much computational burden. The structure of the SE Attention Mechanism is shown in
Figure 4.
In step 1, the images were feature-extracted by convolution. Step 2 is the compression phase (Fsq). A global average pooling of the H and W dimensions of the feature maps. Step 3 is the incentive phase (Fex). A series of fully connected layers act on the output of the compression phase and generate a channel attention vector. The final step is rescaling the operation (Fsc). Through two steps of Fsq and Fex, the SE attention mechanism can adaptively learn the importance of each channel. When the weight increases, the value of the feature graph will increase correspondingly, and the influence on the output will increase, otherwise, when the weight decreases, the value associated with it will decrease. This computational process improves the network’s ability to express and distinguish the features of the target regions and enhances the performance of the algorithm to detect small targets.
2.2.2. SPPCSPC-F Spatial Pyramid Pooling
By introducing multi-scale feature pooling methods, YOLOv5 has significantly improved its ability to detect objects of different sizes. The SPPF is an improvement in the structure of SPP. The SPPF further optimizes computational efficiency by utilizing a single shared pooling layer for multi-scale aggregation, reducing computational overhead while maintaining feature richness. Compared to traditional methods, these improvements enhance the ability to detect both small and large objects with lower computational costs.
The convolutional kernel CBL in the internal structure of SPP consists of Convolution, Standard Normalization, and Leaky ReLU, respectively. The structure of SPPF is superior to that of SPP, in that the maximal pooling layers will be in chunks by connecting them in series, thus speeding up the computation. The structure of SPPF is shown in
Figure 5. shown, the convolution kernel CBS consists of convolution, standard normalization, and SiLU activation function, and its structure contains three consecutive maximal pooling layers of size 5 × 5, which can acquire three different scales of sensory fields.
Although the maximum pooling operation can expand the sensory field to obtain rich contextual information, this nonlinear downsampling reduces the spatial resolution of the feature maps and may lose some of the discriminative information in the original feature maps, making the detection of small targets ineffective. This pooling structure easily leads to overfitting, which needs to be avoided by using more training data or regularization.
In summary, the dual-branch architecture of this module effectively reconciles the competing demands of multiscale feature extraction and spatial detail preservation. The parallel processing pathways enable concurrent capture of both macroscopic ice field distribution patterns and microscopic floe characteristics. Through optimized pooling operations and feature fusion strategies, the module maintains computational efficiency while achieving these objectives [
27].
In order to solve the above problems of SPPF, the advantages of the SPPCSPC structure in the YOLOv7 network model are integrated to improve the structure of SPPF, and the SPPCSPC-F module is obtained, and the SPP structure in the SPPCSPC is changed into the SPPF structure, which is because the SPPF structure has better accuracy and speed, and it is placed in layer 10 of the YOLOv5 backbone network. Layer 10 of the YOLOv5 backbone network and its structure are shown in
Figure 6.
The SPPCSPC-F module first splits the input feature map into two branches by channel. One part of the branch undergoes maximum pooling at three different scales to obtain multi-scale sea ice information. The other part of the branch goes through 1 × 1 convolution directly to maintain the original resolution. Then the two parts of the features are connected by a channel, so that the multi-scale features are obtained and the original detail information is retained, and finally the two convolutions are further feature fused. The SPPCSPC-F module modifies the order of the maximum pooling, which retains the details while keeping the sensory field unchanged, and enhances the feature expression capability with stronger feature fusion ability. The introduction of the SPPCSPC-F module enhances the network’s capability to integrate multi-scale sea ice features while preserving crucial edge detail information. This module optimizes the feature extraction process, enabling more precise identification of diverse sea ice targets in satellite imagery, ranging from fragmented ice floes to continuous ice fields.
2.2.3. FReLU Activation Function
The original YOLOv5 algorithm uses the SiLU activation function. This traditional activation function has some limitations, such as when the input value is far away from 0, the derivative of the SiLU activation function will tend to 0, which will lead to the problem of vanishing gradient. The vanishing gradient will make it difficult for the model to perform effective backpropagation, which will make it difficult for the network to converge or cause instability in training, reducing the efficiency of training and even leading to loss of information.
In this paper, we use the FReLU activation function, which is more suitable for the target recognition task, to replace the SiLU activation function. FReLU is a kind of funnel function, which is obtained by the improvement of the ReLU activation function, and extends the ReLU function by adding a spatial condition to expand the space to two dimensions, which is a relatively simple process to realize, and only adds a small computational overhead [
28]. The structure of the two activation functions is shown in
Figure 7.
This replacement of FReLU brings several specific benefits, including significantly faster training speed, enhanced ability to handle small objects or complex data, and reduced resource consumption, making it more suitable for deployment in resource-constrained environments. The FReLU activation function incorporates learnable parameters that allow the network to adaptively adjust the shape of the function through learning. This flexibility enhances the learning ability of the model and better adapts to the characteristics of sea ice images, and the advantages of the FReLU activation function in nonlinear transformation and feature enhancement can improve the performance of the target recognition model.
In this paper, Combining the SE attention mechanism, the SPPCSPC-F spatial pyramid pooling module, and the FReLU activation function improves the performance of the model on specific tasks. It can learn more important feature representations, focus on key object regions, mitigate the overfitting problem, and improve the generalization ability of the model when recognizing sea ice with smaller sizes in remote sensing images. The synergistic effect of the two enables the model to obtain better recognition performance under limited data conditions.
2.2.4. Optimizing the Overall YOLOv5 Framework
The SE attention mechanism is added to the 9th layer of the backbone network, and the spatial pyramid pooling structure is improved in the 10th layer, while the SiLU activation function is replaced with the FReLU activation function, and correspondingly, the CBS layer of the network is changed to the CBF layer to optimize the structure. The YOLOv5 model is shown in
Figure 8.
The optimization of YOLOv5 increases the depth and size of the network, which adds some computational overhead, but as the depth of the network increases, the model can learn more complex and abstract representations of sea ice features, improving the accuracy of the algorithm.
Table 1. shows the parameters of the network before and after YOLOv5 optimization.
2.3. Experimental Results
2.3.1. Construction of the Data Set
This study utilizes multi-source satellite imagery to ensure robust sea ice detection under varying real-world conditions. Three primary datasets were selected to incorporate variations in spatial resolution, weather conditions, lighting environments, and geographical locations. The first dataset comprises 256 × 256 pixel RGB images from the NWPU dataset published by Northwestern Polytechnical University, with specific technical parameters detailed in reference [
29]. The second dataset includes Arctic sea ice imagery acquired from Google Earth (
http://earthengine.google.com/) with a spatial resolution of 0.5 m in the RGB spectral bands, formatted as 600 × 600 pixel images. The third dataset consists of multi-sensor remote sensing images provided by the Norwegian University of Science and Technology (NTNU), featuring data from three different satellite sensors across four polarization modes. Representative samples from these sea ice remote sensing datasets are illustrated in
Figure 9.
Due to the datasets’ origin from diverse geographical regions and spectral sources, they contain inherent differences in resolution and lighting conditions. To address these challenges and simultaneously increase the number of training samples, we employed several data augmentation methods. These included the Mosaic operation, perspective conversion, left-right flipping, and rotation, as illustrated in
Figure 10.
The Mosaic operation was a critical part of this process. Its main idea is to enrich the dataset by randomly cropping and scaling multiple images before stitching them together into a single, new image. This approach forces the model to learn and adapt to a wide variety of scales and contexts, improving its robustness and generalization capabilities in real-world scenarios.
The remote sensing sea ice dataset is constructed after the steps of de-weighting, manual labeling, and auditing. The label name is “ice”, the number of images is 600, and the number of labels is 15,948, which contained all ice within the dataset. It makes more diverse samples to participate in the model training. And annotations are created using the LabelImg v.1.8.6 tool, which enhances 15,948 annotation precision.
The number of labels is large relative to the number of images, which is due to the large number of sea ice targets in a large-size remote sensing image and is very dense. Therefore, a sliding window was used to cut a specified-size (such as 416 × 416 px) image as the input. YOLOv5 ensures no data leakage during dataset splitting by setting random seeds, strictly dividing training, validation, and test data paths, checking for duplicate file paths, and using hash verification mechanisms. Additionally, it allows manual specification of data paths to strictly control data distribution, ensuring that the training, validation, and test sets do not overlap. In this way, The dataset is split into training, validation, and test sets in a 7:2:1 ratio, providing a broad feature set for the detection model and contributing to higher accuracy.
2.3.2. Parameterization of the YOLOv5
The training parameters are set as follows, the number of iterations is 300, the initial learning rate is 0.001, the momentum parameter is 0.9, the weight decay parameter is 0.0005, and the threshold of the non-great suppression ratio is 0.5. Evaluation is carried out every 30 rounds of training. The F1 value is used as a comprehensive index to reconcile the accuracy P and recall R values, which can comprehensively evaluate the quality of the optimization model, and the larger the F1 value indicates that the quality of the model is higher. The detailed hardware and software parameters are shown in
Table 2. Specific training parameters are given in
Table 3.
The value Average Precision (AP) and the mean Average Precision (mAP) are generally used in the field of target recognition to evaluate the quality of algorithms. Precision (P) and Recall (R) are used to plot the Precision Recall (PR) Curve and calculate mAP by integrating them; P quantifies the effectiveness of sample classification and R quantifies the ability to detect positive samples [
30].
The precision rate P is the probability of identifying correctly in all positive samples, also known as the check rate in the definition of model prediction. Recall R is the probability of identifying correctly in all positive samples. The average precision AP can be a more comprehensive measure of the model, the recall rate R indicator as the horizontal coordinate, the accuracy rate P indicator as the vertical coordinate, one time to plot the PR curve, the PR curve and the area enclosed by the horizontal coordinate is the average precision AP. The calculation of the mAP is generally divided into two steps: the first step is to calculate the average of each category in the AP of each category in the dataset, and the second step is to take the average value after summing the average accuracies of each category. Overall, a good target recognition model should have both high accuracy P and recall R, and further, a high mAP value. The relevant equations are shown in Equations (1)–(5) [
31].
where the number of samples correctly categorized as positive samples is known as Ture Positives (TP), the number of samples incorrectly categorized as positive samples is known as False Positives (FP), the number of samples The number of correctly categorized negative samples is called True Negatives (TN) and the number of incorrectly categorized negative samples is called False Negatives (FN).
2.3.3. Ablation Experiments
To validate the model, ablation experiments were set up. The results of the ablation experiments are shown in
Table 4 and
Figure 10.
Based on the results of the ablation experiments, it can be seen that the original YOLOv5 has a mAP of 0.719, which is the lowest of all the evaluated models. Adding the SE attention mechanism improves the mAP by 1.9% to 0.738. In addition, to show the attention mechanism improvements and the optimization of the multi-scale Spatial Pyramid Pooling, the Simplified SPPF (SimSPPF) only improves the mAP by 0.6% to 0.725. Although SimSPPF has a small improvement compared with YOLOv5, it is not as good as the SPPCSPC-F selected in this paper. Adding SPPCSPC-F spatial pyramid pooling improves the mAP by 2.4% to 0.743. However, the R-value is relatively low at 0.688. replacing the FReLU activation function improves the mAP by 2.8% to 0.747. When the three optimizations were used simultaneously, the mAP improved by 3.5% to 0.754, the best-performing set of models in the ablation experiments. Similarly, the P, R, and F1 values of the original YOLOv5 were 0.719, 0.684, and 0.701, respectively. After optimization, the P, R, and F1 values were 0.753, 0.703, and 0.727, which were improved by 3.4%, 1.9%, and 1.8%, respectively.
It can be seen from
Figure 11, the yellow curve represents the map with only the added SE module, which has the lowest peak compared to the other three curves. The red curve represents the fully optimized map, with the best effect. The results show that the improved YOLOv5 has higher accuracy in recognizing remote sensing sea ice images.
2.3.4. Comparison Experiments
To further validate the effectiveness of optimizing YOLOv5, this paper sets up a comparison experiment. The optimized algorithm is compared with the original YOLOv5 and other target detection models such as Faster-RCNN, YOLOv3, YOLOv4, and the current newer YOLOv8 model, and the Loss value and mAP value of each model are calculated, and the comparison results are shown in
Figure 12.
From the trend of Loss value, it can be seen that in the first 40 epochs, the Loss value of each model decreases rapidly, which indicates that the network is learning the features of the sea ice rapidly and the training has not yet reached the stable stage. After 200 epochs, the training is gradually stabilized, in which the optimization model has a lower Loss value than the other algorithms, which indicates that the optimization of YOLOv5 has a fast convergence speed. All algorithms converge to stability at 250 epochs. In the stabilization phase, the optimized model has a lower Loss value and higher mAP value, which indicates better generalization ability and detection performance of YOLOv5-ICE compared to other models.
The individual trained models are evaluated, and the results are compared as shown in
Table 5. Compared to Faster-RCNN, YOLOv3, YOLOv4, YOLOv5, and YOLOv8, the mAP values of the YOLOv5-ICE model are improved by 15%, 10.6%, 9.9%, 3.5%, and 1.3%, respectively. Among them, YOLOv3 has the lowest mAP of 0.604, and the optimized algorithm has the highest mAP of 0.754. YOLOv3, YOLOv4, and YOLOv8 have higher accuracy value P, but lower recall R. The F1 values of 0.552, 0.636, and 0.617 indicate that there is ice under detection for the identification of targets by the three models, and the YOLOv5-ICE has the highest F1 value of 0.727. From the results of the comparison experiments, it can be seen that the YOLOv5-ICE can better detect sea ice targets in remote sensing images.
The large-size sea ice remote sensing images are recognized using the YOLOv5-ICE. Since the sea ice targets are too dense, the confidence level is hidden in the resulting graph in order to better show the recognition effect. The recognition results of the algorithm before and after the optimization and the local zoomed-in area are shown in
Figure 13. In the same area, the number of sea ice recognized by the original YOLOv5 is 14 and 55, and the number of sea ice recognized by the YOLOv5-ICE is 53 and 88. In comparison, the number of recognized sea ice has increased by 39 and 33, respectively, and most of them are small targets that are difficult to detect.
The results of remote sensing sea ice image recognition can provide environmental information for ship path planning, and the detailed sea ice information provides reliable input data for the ship path planning algorithm, which helps to generate the optimal path plan that is more in line with the actual ice conditions.
4. Path Planning Results and Discussion
4.1. Path Planning Under Low Sea Ice Concentration
Remote sensing sea ice images with a fixed spatial resolution of 10 km × 10 km are selected and input into the YOLOv5-ICE algorithm for recognition, and the recognition results are extracted. The corresponding navigation scenarios are divided into 100 × 100 environmental grids, with a total of 10,000 grids, and the spatial resolution corresponding to one grid is 0.1 km. To more conveniently display the planned path, 1 large grid in the figure contains 10 small grids.
K-means clustering is used to obtain sea conditions with different sea ice densities, and different navigation scenarios of ships in the polar regions are simulated. The process of K-means begins by inputting a remote sensing image. Firstly, the number of clusters, denoted as k, is determined. Then, k cluster centers are initialized. The next step involves calculating the difference between the RGB values of each image pixel and the cluster centers. This difference is compared against a predefined threshold. If the difference exceeds the threshold, the process returns to the previous step to recalculate. If the difference is within the threshold, the process continues by outputting the proportion of each color in the k clusters. Subsequently, the sea ice concentration is calculated. Finally, a new remote sensing image can be inputted to repeat the entire process.
After calculating the
k value through the sea ice dataset, six different navigation scenarios with varying sea ice concentrations were obtained. The sea ice concentrations of the scenarios are calculated at 5.0%, 8.3%, 15.7%, 18.4%, 25.1%, and 40.9%, respectively [
30]. Based on the recognition results of remote sensing sea ice images, the corresponding raster maps are constructed, and the Theta* algorithm is used for path planning. Setting the first grid of the map as the starting point and the last grid as the endpoint, the path planning results are shown in
Figure 17.
The results of the planned path for sailing scenarios under different sea ice concentrations are shown in
Table 6, which contains the path length, the number of ship turns, the number of sea ice avoidance, and the sea ice risk value. The parameter comparison is visualized in
Figure 18.
From the results, it can be seen that the sea ice concentration parameter significantly affects the ship’s path selection and the safety of navigation. Specifically, when the ship navigates under different sea conditions, with the increase in sea ice concentration, its path shows a more complex direction, the number of turns and the number of sea ice obstacle avoidance show an increasing trend, and the value of the sea ice risk increases. The numerical analysis results show that when the sea ice concentration increases from 5.0% to 40.9%, the length of the ship’s path increases by 3.81 km, the number of turning adjustments to avoid sea ice obstacles increases by 23 times, and the overall sea ice risk value increases by 11.6%. Changes in sea ice concentration can have a significant impact on ship navigation safety and path optimization.
Target recognition provides information about the distribution and morphology of sea ice, and path planning avoids dense sea ice areas according to this information. This paper utilizes the algorithms of target recognition and path planning, combines perception and decision-making, and finds a shorter, smoother, and safer path in the complex sea conditions in the polar region, which provides technical support for the navigation of polar ships [
35,
36,
37].
4.2. Path Planning Under High Sea Ice Concentration
Section 4.1 analyzes path planning under the highest observed sea ice concentration of 40.9%. However, boundary inflation processing presents difficulties when applied to high-concentration sea ice detection results. While this method proves effective for individual ice floes when using uniform buffers, it becomes unsuitable for high-density scenarios where the ice map consists of densely packed rectangular grids (1000 × 1000). Applying spatial expansion to these grids would classify nearly the entire map as obstructed space, represented by solid black areas, thus preventing feasible path planning.
To address this challenge, we employ a binarization process to convert the high-density ice map into a grid-based representation. Traditional grid-based navigation methods categorize each cell as either blocked or unblocked, with black cells indicating sea ice and white cells representing open water. The high-density sea-ice maps before (10 km × 10 km) and after conversion are shown in
Figure 19.
In
Figure 19, the image is converted to the HSV color space to detect the red borders first. By identifying the contours of the red borders, the positions of the detection boxes are obtained. Subsequently, a 1000 × 1000 gridded environment map is created. All detection boxes are iterated through, and the grids within each detection box are marked as sea ice. In the code setup, only two colors are defined: 0 for black, representing water, and 1 for white, representing sea ice. The pixel positions of the detection boxes are obtained and converted into grid indices. The row and column indices of the detection boxes in the grid map are calculated. Finally, based on the markings, the gridded sea ice distribution map is displayed, which corresponds to the binarized grid map composed of black and white colors on the right side of
Figure 19.
Because of that basic Theta* may fail in areas of high ice density, as it models ice as impassable regions. So, we extend the basic Theta* algorithm to non-uniform grid maps (Non-uniform Theta*). Representing map traversability with non-uniform cell costs, where a layer of risk cells with diminishing grayscale values is generated. Representing map traversability with non-uniform cell costs, where a layer of risk cells with diminishing grayscale values is generated.
The risk function R(c) is defined as follows to account for the spatial relationship to a threshold distance of 20 grids. For distances greater than or equal to 20 grids, the risk value remains constant at 1, indicating a stable and minimal level of risk. For distances within 20 grids, the risk increases nonlinearly as r decreases, which ensures a steeper growth of risk closer to the origin. The addition of 1 ensures a minimum risk baseline across all distances. This formulation captures the concept of escalating risk in proximity to the critical threshold while stabilizing risk at greater distances.
The traversal cost function edge represents the traversal cost associated with moving from the parent node to the child node. This formulation is particularly useful for path planning or graph traversal problems where both distance and risk must be considered.
The risk function and traversal cost function used are shown in Formulas (9)–(11) [
38].
where r represents the distance (in grids) to a reference location.
where c
p represents a parent node, c
child represents a child code, dist(c
p, c
child) denotes the distance between the parent node and the child node
represents the average risk factor, which scales the distance to account for associated risks in the environment.
By using the function above,
Figure 17 can be converted into a raster plot with a risk index in
Figure 20a. The results of the path planning are shown in
Figure 20b,c.
It is shown in
Figure 20 how the image with a risk index is converted. The risk function is defined by comparing the distance from the center of each sea ice block to its edge with the radius of the sea ice. In Formula (9), a dynamic risk classification is applied to each sea ice block. This figure is represented by a gradient of colors—the closer the color is to the center of the sea ice, the darker it appears, while the farther it is from the center, the lighter it becomes. Due to the dynamic risk index, the risk distribution varies for sea ice blocks of different sizes.
Compared with the previous images, it can be seen that if the rectangular frame grid conversion method is used for the high density of sea ice images, the converted image will be all white. Therefore, using a raster transformation method with a risk function, one can subjectively judge that a path can arise between sea ice. Based on
Figure 20a, a polar path-planning algorithm is employed.
Figure 20b,c illustrate that the route planned by the non-uniform Theta* algorithm can pass through some small ice floes but avoid larger ones. The red solid line in the image represents the collision avoidance path obtained using the algorithm. The green in the image represents the number of turns of the ship. As can be seen from the figure, these two paths at different starting points, take 39 and 40 turns, respectively.
5. Conclusions
In this paper, remote sensing sea ice identification is performed using computer vision methods. Based on the identification results, polar ship route planning is conducted to assist in decision-making for navigation. The following conclusions can be drawn.
- (1)
Targeted optimization of YOLOv5 is carried out according to the characteristics of remote sensing images. This optimization includes three improvements: adding the SE attention mechanism, improving the spatial pyramid pooling structure, and replacing the activation function with FReLU, which is more suitable for the target identification task.
- (2)
Ablation experiments are conducted to compare the effects of different improvement methods. When the optimization methods are applied simultaneously, the mAP improves by 3.5%. Comparisons are conducted to verify the effectiveness of the optimization algorithm by trying Faster-RCNN, YOLOv3, YOLOv4, YOLOv5, and YOLOv8. The YOLOv5-ICE achieves an mAP of 75.4%, which is 1.3% higher than that of YOLOv8, making it the best-performing model. The number of detected sea ice instances increased by 39 and 33.
- (3)
Path planning for polar ships under low concentrations is based on the setting of the risk assessment grid plot. Images with high sea ice density are successfully converted by setting the risk function—no longer a plain raster plot of pure white rectangular box conversion. Using the Theta * algorithm, the path can be successfully simulated. It can avoid the otherwise impassable sea ice, taking 39 and 40 turns, respectively.
Although the study has successfully detected sea ice in remote sensing images and performed path planning based on it, there are still some limitations. This study focuses on the innovative integration of an improved object detection algorithm with path planning. For ship movement, an idealized approach was adopted: in high ice concentration areas, the vessel is simplified into a point-mass model, and its own factors are considered idealistically in the path planning process. In future work, we will employ image segmentation techniques to accurately extract ice floe geometries, while incorporating real-world vessel maneuvering capabilities into the framework.