1. Introduction
Ocean fronts are boundary regions in the ocean where physical properties such as temperature, salinity, and density change sharply, often appearing as narrow bands with intensified temperature gradients. As significant mesoscale features in ocean dynamics, fronts exert profound impacts on material transport [
1], climate regulation, and the distribution of primary productivity, while also playing a crucial role in the stability of marine ecosystems. In the context of accelerating climate change, the evolution and structure of ocean fronts have increasingly influenced ocean circulation patterns, fishery distributions, and ecosystem health. Therefore, accurate detection of ocean fronts is vital for understanding climate system evolution, enhancing marine forecasting, and supporting evidence-based resource management.
However, ocean front detection remains challenging due to the complex and dynamic nature of the marine environment. Fronts often exhibit multiscale coexistence, fuzzy boundaries, and strong background noise [
2,
3], which severely limit the performance of traditional detection methods. Existing approaches mainly include gradient-based thresholding, edge detection, and physics-based models. While such methods can extract frontal features under idealized conditions, they rely heavily on fixed parameters and prior knowledge, making them less effective in dynamically evolving or noisy environments. For example, the global front detection algorithm based on SST and chlorophyll proposed by Belkin and O’Reilly [
2] performs well in open oceans but suffers from high false positives and inaccurate boundary localization in coastal regions.
In recent years, deep learning techniques have achieved remarkable progress in image recognition and semantic segmentation, providing new solutions for intelligent detection of complex marine structures. Compared with traditional algorithms, deep neural networks can learn hierarchical, nonlinear features from raw data via end-to-end training, significantly improving their ability to model unstructured patterns. U-Net and Mask R-CNN represent two prominent deep learning architectures in this field.
U-Net was originally proposed by Ronneberger et al. [
4] for biomedical image segmentation [
5,
6]. Its symmetric encoder–decoder architecture, combined with skip connections, enables high-fidelity reconstruction of spatial structures by reusing high-resolution encoder features during decoding; this improves thin-structure and edge recovery but may also retain high-frequency fluctuations when the input is noisy or low-contrast. The model has been widely adopted in remote sensing applications such as coastline extraction, building delineation, and vegetation classification. Mask R-CNN, introduced by He et al. [
7], extends Faster R-CNN by incorporating a mask prediction branch. It utilizes Region Proposal Networks (RPN) and Feature Pyramid Networks (FPN) to encode multi-scale spatial features, allowing simultaneous object detection and pixel-level segmentation. This design excels in challenging scenarios with occlusions, blurry boundaries, and dense targets.
Li et al. [
8] applied U-Net to detect ocean fronts in the East China Sea and Kuroshio Extension, demonstrating the model’s ability to restore frontal axis morphology and capture fine boundary features. Zhang et al. [
9] investigated the frontogenesis of North Pacific subtropical sea surface temperature fronts and analyzed their dynamical connection with the overlying atmosphere, providing important insights into the physical mechanisms governing ocean front variability. More recently, deep learning-based ocean-front studies have further emphasized efficiency and rapid extraction capabilities, such as the edge-detection-model-based front mapping proposed by Felt et al. [
10] and the lightweight SQNet designed for fast ocean-front identification by Niu et al. [
11]. More recently, deep learning-based ocean-front studies have further emphasized efficiency and rapid extraction capabilities, such as the edge-detection-model-based front mapping proposed by Felt et al.
Furthermore, U-Net has been applied to urban road extraction, while Mask R-CNN has shown effectiveness in forest and water body classification for natural resource monitoring.
This study aims to conduct a comparative analysis of U-Net and Mask R-CNN for automated ocean front detection in the Northwestern Pacific. Using a curated dataset of satellite-derived sea surface temperature (SST) images, the performance of both models is evaluated under varying spatial scales, structural complexity, and background noise conditions. The comparison covers multiple dimensions, including segmentation accuracy, boundary consistency, and computational efficiency. Building on the comparative findings, we further discuss the feasibility of hybrid, coarse-to-refined strategies that combine the complementary strengths of U-Net and instance-aware segmentation frameworks to potentially improve stability and robustness in real-world marine scenarios. The findings offer practical insights for the application of deep learning in ocean remote sensing and contribute to the development of intelligent front detection techniques.
Based on this context, this study focuses on the Northwestern Pacific region and conducts a comparative analysis of two deep learning models—U-Net and Mask R-CNN—in the task of automatic ocean front detection. A remote sensing dataset primarily composed of sea surface temperature (SST) imagery is constructed to evaluate the adaptability and performance of both models under varying conditions of spatial scale, structural complexity, and background interference. The assessment encompasses a quantitative comparison of detection accuracy, spatial consistency, and boundary precision, along with an evaluation of computational efficiency and model deployment cost. The findings aim to provide theoretical support and practical guidance for advancing intelligent ocean front recognition in remote sensing applications.
2. Study Area and Data
2.1. Study Area and Data Source
This study focuses on the Northwestern Pacific region, covering latitudes 0–50° N and longitudes 100–150° E, as shown in
Figure 1.
Figure 1 illustrates the geographic extent of the study domain based on SST reanalysis coverage, where coastlines, marginal seas, and major oceanic circulation systems are indicated to clarify the environmental context of the analysis. The study area encompasses several key marginal and open-sea domains, including the Bohai Sea, Yellow Sea, East China Sea, South China Sea, and parts of the Western Pacific Ocean. It spans temperate, subtropical, and tropical climatic zones, characterized by the confluence of multiple water masses and current systems such as the Asian monsoon circulation and the Kuroshio Current and its extension. The region features a complex topographic setting with continental shelves, island arcs, deep ocean basins, trenches, and coral reef ecosystems, making it highly heterogeneous in hydrographic and thermal structures. Due to its dynamic oceanographic processes and frequent frontal activities, this region plays a critical role in modulating regional climate variability, biogeochemical cycling, and the evolution of marine ecosystems. Accordingly, it serves as an ideal testbed for evaluating the robustness and adaptability of deep learning models in ocean front detection.
The SST data used in this study were obtained from the GLORYS12V1 global ocean reanalysis product, distributed by the Copernicus Marine Environment Monitoring Service (CMEMS,
https://data.marine.copernicus.eu/product/GLOBAL_MULTIYEAR_PHY_001_030/services (accessed on 1 October 2025)). Developed by Mercator Ocean International, GLORYS12V1 is a fourth-generation global reanalysis system that assimilates satellite observations, in situ measurements, and numerical simulations. It is based on the NEMO ocean circulation model and the LIM3/SAM2 data assimilation framework. The product provides global, daily, three-dimensional ocean temperature fields from January 1993 to December 2020, with a horizontal resolution of 1/12° and 50 vertical layers, offering sufficient spatial and temporal granularity to resolve oceanic frontal structures.
Because the GLORYS12V1 dataset is a reanalysis product that assimilates satellite and in situ observations through data assimilation, it provides spatially continuous SST fields without cloud-induced gaps. Therefore, explicit cloud masking or gap-filling procedures were not required in this study.
The dataset is provided in NetCDF format and referenced to the WGS 84 coordinate system. For the purposes of this study, daily mean SST data covering the 100–150° E and 0–50° N domain were extracted. The high spatial resolution and temporal continuity of this dataset make it well-suited for training and validating deep learning models in a variety of oceanographic conditions, thereby ensuring the generalizability and credibility of the experimental results.
2.2. Data Preprocessing
To ensure the physical consistency, numerical stability, and semantic accuracy of model inputs, a systematic preprocessing pipeline was applied to the original remote sensing data. This process comprised four key stages: sample selection, numerical normalization, data augmentation, and high-quality mask annotation.
During sample screening, we computed the two-dimensional SST gradient field and selected scenes with prominent frontal signatures, followed by visual inspection to remove distorted images or scenes without clear frontal structures. The resulting dataset contains 512 SST images, covering multiple seasons and subregions within the Northwestern Pacific to ensure representative spatiotemporal variability.
To improve convergence stability during training and minimize the impact of inter-sample scale variation on model learning, Min-Max normalization was applied independently to each image, linearly mapping pixel values to the [0, 1] interval. This ensured consistency in the input distribution and enhanced the physical interpretability of the training data.
Subsequently, to enhance the model’s generalization capability under complex environmental conditions and varying observational scenarios, a multi-strategy data augmentation procedure was applied to the training samples. The augmentation techniques included geometric transformations such as random rotation, horizontal flipping, and scaling, which improve the model’s robustness to spatial variability.
Figure 2 presents a representative example of the augmentation process.
In the mask generation phase, considering the inherent ambiguity of ocean front boundaries and the continuous variation in spatial scales, this study employed the Labelme [
12,
13,
14] annotation tool—primarily using the polygon tool—to manually delineate ocean front contours (
Figure 3). By connecting the vertices of each polygon, labeled regions were created and stored as JSON files to generate the initial binary masks. To ensure annotation reliability, all samples were independently labeled by two trained annotators with an oceanographic background, followed by expert reconciliation of disagreements. The inter-annotator agreement, quantified using Cohen’s kappa coefficient, reached 0.82, indicating high labeling consistency.
This comprehensive preprocessing workflow substantially improved the quality and adaptability of the training data, providing a robust foundation for the effective training and reliable evaluation of deep learning models in ocean front detection.
3. Methodology
3.1. Model Architecture
The U-Net architecture, originally proposed by Ronneberger et al. [
4], is a widely used fully convolutional network (FCN) designed for pixel-wise semantic segmentation tasks in medical and remote sensing imagery. As illustrated in
Figure 4, the network adopts a symmetric encoder–decoder structure consisting of a contracting path and an expanding path. In this study, a vanilla U-Net is used, with a four-stage encoder–decoder and skip connections for multi-scale feature fusion. The input is a single-channel SST image, and the output is a single-channel binary front mask.
In the contracting path, each stage comprises two consecutive convolution operations followed by a max-pooling downsampling operation. Mathematically, the feature transformation at layer l can be expressed as:
where
denotes the encoder-side feature map, and Concat represents channel-wise concatenation. This architecture is particularly suitable for detecting elongated, continuous structures with vague boundaries, such as oceanic fronts.
Figure 5 illustrates the architectural layout of the Mask R-CNN model adopted in this study. The overall design builds upon the two-stage object detection framework of Faster R-CNN and extends it with an instance segmentation branch to enable pixel-level mask prediction. Structurally, the network consists of five convolutional modules, three pooling operations, two fully connected (FC) layers, and a classifier head, forming a multi-branch deep architecture capable of multi-scale feature extraction and region-level semantic modeling.
The input image is first processed by a backbone feature extractor—typically ResNet-50 or ResNet-101—combined with a Feature Pyramid Network (FPN), which together constitute the five hierarchical convolutional stages. These layers progressively encode semantic information at multiple spatial scales. Interleaved within the convolutional stages, three max-pooling operations compress spatial dimensions while preserving key features, facilitating multi-scale representation alignment.
Following feature extraction, the Region Proposal Network (RPN) identifies candidate Regions of Interest (RoIs) and predicts their objectness scores and bounding box offsets. To improve spatial alignment, the model employs a RoIAlign operation, which replaces conventional RoIPooling and ensures sub-pixel precision in feature alignment, mitigating boundary mislocalization caused by quantization.
Subsequently, two parallel fully connected layers are introduced to perform classification and bounding box regression, respectively. In parallel, a lightweight fully convolutional mask branch is applied within each RoI to generate high-resolution binary masks, delineating object contours at the pixel level. The final classifier integrates outputs from all branches to produce the final instance-level predictions.
3.2. Model Training
To ensure a fair comparison of model performance, all training procedures were conducted under a unified configuration. The Adam optimizer was employed with an initial learning rate set to 1 × 10−4, a batch size of 8, and a total of 100 training epochs. All SST inputs were resized to 512 × 512 and normalized to the range [0, 1] using Min–Max scaling. The U-Net model adopted a single-channel input and generated a single-channel binary front mask. In addition, a weight decay of 1 × 10−4 was applied during optimization, and the learning rate was scheduled using StepLR with a decay factor gamma of 0.1 at a fixed step interval. An early stopping strategy was introduced to mitigate the risk of overfitting.
In this study, the design of the loss functions was tailored to the characteristics of each task. For U-Net, which targets full-image semantic segmentation, the loss function was defined as a weighted combination of Binary Cross-Entropy (BCE) [
15,
16] and Dice Loss, enabling a balance between global segmentation accuracy and boundary precision.
We set α = 0.5 to equally weight BCE and Dice losses. The Dice Loss emphasizes the overlap in boundary regions, making it well-suited for imbalanced class distributions and fine-structure extraction. This helps enhance the model’s ability to reconstruct fine-scale frontal features.
The training of Mask R-CNN involves three tasks: classification of candidate regions, bounding box regression, and pixel-level mask prediction. Each of these tasks corresponds to a distinct loss component, as previously described. This multi-task optimization strategy enhances the model’s robustness in handling complex image regions and targets with indistinct boundaries. The overall loss function is formulated as a weighted sum of these task-specific losses.
The predefined training/validation/test split described in
Section 2.2 was used for all experiments to ensure consistent evaluation.
3.3. Evaluation Metrics
To systematically evaluate the performance of U-Net and Mask R-CNN in ocean front detection over the northwestern Pacific, this study adopts a multi-dimensional evaluation framework covering segmentation accuracy, boundary consistency, detection sensitivity, and computational efficiency. These metrics provide a holistic assessment of model capabilities under varied spatial and dynamic oceanic environments.
3.3.1. Spatial Segmentation Accuracy
To quantify how well the predicted masks align with the ground truth, we use the Intersection over Union (IoU) [
6,
8,
10,
17,
18] and the Dice Similarity Coefficient (Dice). These are defined as:
where
P denotes the predicted mask and
G the ground truth. IoU evaluates the proportion of the intersected region over the union, focusing on overall area agreement, while Dice gives more weight to correctly predicted pixels in smaller regions, enhancing sensitivity to narrow frontal structures.
3.3.2. Boundary Adherence
To evaluate how accurately the models capture the fine-scale edges of fronts, we apply Hausdorff Distance (HD) and Contour Matching Score (CMS). The Hausdorff Distance is defined as:
where
d(
p,
g) represents the Euclidean distance between points on the predicted and ground truth boundaries. This metric reflects the worst-case boundary deviation. CMS, on the other hand, evaluates contour alignment by computing the average similarity between contours extracted from predicted and ground truth masks under multi-scale dilation kernels. Its exact form depends on the structural similarity formulation and is implemented via contour overlap matching. CMS is used to quantify contour-level alignment with multi-scale boundary tolerance. Specifically, we first extract one-pixel-wide contours from the predicted and ground-truth binary masks, denoted as
and
. To account for slight boundary ambiguity, we apply morphological dilation to the contour maps with structuring elements of sizes
. For each dilation scale
, a contour overlap score is computed as an IoU between the dilated contours. The final CMS is defined as the average IoU across all dilation scales.
3.3.3. Detection Sensitivity
Detection performance in complex backgrounds is evaluated using the F1-score, defined as the harmonic mean of precision and recall:
where
TP,
FP, and
FN denote true positives, false positives, and false negatives, respectively.
F1-score is especially useful in identifying weak or fragmented frontal structures in high-noise SST scenes.
3.3.4. Computational Efficiency
To assess the models’ feasibility for real-time or large-scale deployment, we report inference latency (average processing time per image) and GPU memory consumption during single-image prediction (batch size = 1). All experiments are conducted on an NVIDIA RTX A5000 (24GB), ensuring consistency across comparisons. In addition, model size and computational complexity are reported to support deployment assessment. Model parameters are defined as the total number of trainable weights, and floating-point operations (FLOPs) are estimated based on the network architecture under the 512 × 512 single-channel SST input configuration used in this study. Under this setting, the U-Net model involves approximately 23 million parameters and about 36 GFLOPs per forward pass, whereas Mask R-CNN with a ResNet-50-FPN backbone involves approximately 48 million parameters and about 220 GFLOPs. These indicators complement latency and GPU memory statistics by providing a quantitative measure of computational demand and potential deployment cost during inference.
5. Conclusions and Future Perspectives
5.1. Conclusions
These findings indicate that the complementary characteristics of U-Net and Mask R-CNN may motivate future exploration of hybrid detection strategies, although such approaches are not implemented or validated in the present study. This study established a unified evaluation framework to systematically compare the performance of two representative deep learning models, U-Net and Mask R-CNN, in the task of ocean front detection in the Northwest Pacific. By incorporating multidimensional metrics and a high-quality benchmark dataset, we comprehensively analyzed their capabilities in spatial segmentation, boundary alignment, fine-scale target recognition, and computational efficiency.
The results demonstrate that U-Net offers advantages in spatial consistency and computational economy, making it well-suited for large-scale, high-efficiency remote sensing workflows. In contrast, Mask R-CNN exhibited superior performance in boundary modeling and robustness in complex backgrounds, showing higher sensitivity to fine-scale frontal structures and better adaptability to multi-front, low-contrast environments.
Despite these promising findings, the study is subject to several limitations. The number of training samples remains constrained, and multi-source data fusion has not yet been explored. Future work may incorporate physical constraints, attention mechanisms, or semi-supervised strategies to further improve model generalization and physical interpretability. In particular, hybrid or multi-stage segmentation frameworks could be implemented by integrating physically meaningful variables (e.g., density gradients or frontal intensity indicators) derived from observational or reanalysis datasets as auxiliary input channels, introducing physics-informed regularization terms into the loss function to enhance dynamical consistency, and optionally applying a staged refinement strategy in regions with weak gradients or complex coastal structures. Such a structured integration of data-driven learning and physical constraints may help address boundary ambiguity and multi-scale frontal variability, while remaining within the scope of future methodological extensions rather than the present contribution.
5.2. Future Perspectives
Despite the promising outcomes demonstrated in this study, several challenges remain that warrant further exploration to advance deep learning-based ocean front detection.
First, oceanic fronts are characterized by diffuse boundaries and highly variable morphologies. Existing models still face difficulties in precisely localizing frontal boundaries, particularly in weak-gradient regions or areas with overlapping frontal structures. Future work should consider incorporating multi-scale feature fusion, boundary-aware modules, or structural priors to enhance the model’s sensitivity to fine-scale edge details.
Second, the present study relies solely on sea surface temperature (SST) data. However, ocean fronts are often associated with concurrent variations in salinity, current velocity, and sea level anomaly (SLA). Integrating multi-modal remote sensing data, such as sea surface salinity (SSS) [
14], geostrophic currents, and SLA, could improve model robustness in thermally indistinct regions and enhance physical consistency.
Lastly, to address the scarcity of labeled data, transfer learning, domain adaptation, and semi-supervised strategies offer promising avenues for extending model applicability across different oceanic regions and seasons. Such approaches can significantly improve generalization and adaptability under diverse environmental conditions.
In summary, future efforts should focus on optimizing model structures, integrating multi-source data, enabling real-time deployment, and enhancing cross-domain generalization. These directions will further solidify the role of deep learning in advancing oceanographic research and supporting sustainable marine resource management. Beyond methodological advances, improved front detection capability also has important oceanographic implications. Reliable delineation of sea surface temperature fronts facilitates the investigation of frontal variability associated with the Kuroshio system and monsoon-driven circulation, including seasonal migration, front intensification, and interaction with mesoscale eddies. These processes are closely linked to upper-ocean material transport, biological productivity, and regional biogeochemical cycling. Therefore, enhanced front mapping may provide valuable observational constraints for ocean dynamic studies and ecosystem-related analyses.