Multi-Task Learning for Ocean-Front Detection and Evolutionary Trend Recognition

He, Qi; Huang, Anqi; Geng, Lijia; Zhao, Wei; Du, Yanling

doi:10.3390/rs17233862

Open AccessArticle

Multi-Task Learning for Ocean-Front Detection and Evolutionary Trend Recognition

by

Qi He

¹

,

Anqi Huang

¹

,

Lijia Geng

²,

Wei Zhao

^1,*

and

Yanling Du

¹

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China

²

Donghai Standard Metrology Center, East China Sea Bureau, Ministry of Natural Resources, Shanghai 200137, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(23), 3862; https://doi.org/10.3390/rs17233862

Submission received: 11 October 2025 / Revised: 25 November 2025 / Accepted: 26 November 2025 / Published: 28 November 2025

(This article belongs to the Special Issue Computer Vision and Pattern Recognition for the Analysis of 2D/3D Remote Sensing Data in Geoscience (Second Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We construct the Zhejiang–Fujian Coastal Front Mask (ZFCFM) and Evolutionary Trend (ZFCFET) datasets for daily ocean-front detection (OFD) and point-level ocean-front evolutionary trend recognition (OFETR) from filled and normalized SST gradients with trend labels derived from hand-annotated fronts.
A multi-task 3D U-Net that shares 3D spatiotemporal features between OFD and OFETR matches strong 2D baselines for OFD with fewer parameters and similar computation, markedly improving OFETR over single-task and traditional methods.

What is the implication of the main finding?

The proposed multi-task framework provides a practical alternative to cascade designs, offering more robust trend recognition under imperfect segmentation and reducing reliance on binary OFD results or auxiliary input channels.
The datasets, model, and analysis framework introduced in this study provide a reproducible benchmark and design guidelines for future work on spatiotemporal OFD and OFETR in other regions and at different spatial and temporal resolutions.

Abstract

Ocean fronts are central to upper-ocean dynamics and ecosystem processes, yet recognizing their evolutionary trends from satellite data remains challenging. We present a 3D U-Net-based multi-task framework that jointly performs ocean-front detection (OFD) and ocean-front evolutionary trend recognition (OFETR) from sea surface temperature gradient heatmaps. Instead of cascading OFD and OFETR in separate stages that pass OFD outputs downstream and can amplify upstream errors, the proposed model shares 3D spatiotemporal features and is trained end-to-end. We construct the Zhejiang–Fujian Coastal Front Mask (ZFCFM) and Evolutionary Trend (ZFCFET) datasets from ESA SST CCI L4 products for 2002–2021 and use them to evaluate the framework against 2D CNN baselines and traditional methods. Multi-task learning improves OFETR compared with single-task training while keeping OFD performance comparable, and the unified design reduces parameter count and daily computational cost. The model outputs daily point-level trend labels aligned with the dataset’s temporal resolution, indicating that end-to-end multi-task learning can mitigate error propagation and provide temporally resolved estimates.

Keywords:

ocean front; deep learning; ocean-front detection; ocean-front evolutionary trend; multi-task learning

1. Introduction

An ocean front is characterized by a strong horizontal gradient in one direction accompanied by a weaker gradient in the perpendicular direction [1]. Ocean fronts reflect interacting buoyancy and dynamical forcings, such as riverine discharge, surface wind forcing, and open-ocean circulation [2,3], with additional modulation by tidal mixing [4], mesoscale dynamics such as eddies and coastal upwelling [5], and interannual climate variability associated with the El Niño–Southern Oscillation [6]. Their types are diverse and include sea surface temperature (SST) fronts, chlorophyll fronts, and salinity fronts, among others.

Ocean fronts and their impacts have received sustained research attention: the evolution of fronts modulates the curl and divergence of sea surface wind stress [7], the distribution of sound speed [8], the intensity of storm tracks [9], the composition of phytoplankton communities [10], and the location of fishing grounds [11]. Front evolution also alters lateral density contrasts, submesoscale activity, and surface roughness contrasts, thereby influencing air–sea exchanges [12,13]. In addition, the evolution of diurnal SST fronts affects the rectification of the diurnal warm layer, which in turn regulates boundary-layer stability, pressure gradients, and near-surface winds [14]. Consequently, both detecting ocean fronts and identifying their evolutionary trends are central to ocean-front research and operations.

A large body of work has addressed ocean-front detection (OFD). Approaches that use gradient operators characterize the edge information of fronts [15,16]. Beyond these edge-based methods, many effective algorithms have been developed in prior studies [17,18], with recent advances including both improvements to existing approaches and new methods. For example, Xing et al. proposed the CCAIM algorithm [19], which improves upon the Cayula and Cornillon algorithm (CCA) [17] by better identifying nearshore fronts, connecting discontinuous fronts, and reducing repeated detections. It significantly increases both the number of nearshore front pixels and the average length of offshore front segments. Wang et al. introduced a Bayesian decision and metric-space-based detection and tracking method (BFDT–MSA) [20] that integrates probabilistic fusion, morphological optimization, and spatiotemporal tracking to effectively reduce over-detection and enhance continuity and spatial consistency.

Although traditional methods offer good interpretability, many rely on threshold settings whose selection is typically experience-driven. While several threshold-free approaches have been proposed in recent years, they still face limitations in detecting small target fronts (STFs).

Deep learning–based approaches have been increasingly applied to OFD in recent years. He et al. proposed the Dense Contextual Ensemble Network (DCENet) [21] to address the difficulty of detecting multi-class STFs in remote-sensing imagery. By introducing a stacked contextual ensemble structure (S-CE), a dense aggregation (DA) block, and a hybrid loss, DCENet markedly improves STFs detection. Wang et al. developed Res-U-Net [22], which incorporates residual modules into U-Net to mitigate gradient vanishing and accuracy limitations in fine-scale OFD, achieving higher accuracy and stability than the vanilla U-Net. Wan et al. proposed a Dynamic Gradient Orientation and Multi Scale Fusion Network (DGOMFN) [23] to address strong noise, limited interpretability, and insufficient multi-scale feature fusion. The DGOMFN integrates a dynamic angle constraint module (DACM) and a multi-scale gradient fusion mechanism (MSGF) and builds on a modified You Only Look Once version 11 (YOLOv11) detection framework that incorporates a cross-scale Transformer, dynamic snake convolution, and scale-aware feature fusion to achieve high accuracy on datasets from the Kuroshio region in the northwestern Pacific and substantially improve weak front detection capability.

OFD focuses on pixel-level identification in images, whereas ocean-front evolutionary trend recognition (OFETR) examines changes in front occurrence and intensity across multiple time scales to characterize stagewise evolutionary trends.

For long-term trends, Xing et al. processed high-resolution SST imagery from 1982 to 2021 using the CCAIM algorithm to construct a digital atlas of persistent fronts surrounding global Large Marine Ecosystems (LMEs) and to assess four-decade trends [24]. Within LMEs, both the occurrence and the intensity of persistent fronts increased significantly overall (global means rising by approximately 0.08% and 0.009 °C/100 km per decade, respectively), with increases concentrated in subtropical and polar regions associated with boundary currents and upwelling systems, whereas the tropics were relatively stable or showed slight declines. Moreover, Yang et al. combined satellite observations with statistical analyses to examine ocean-warming hotspots over 2003–2020. They found that frontal activity and chlorophyll generally declined in equatorial and subtropical-gyre hotspots, increased in most high-latitude hotspots, and showed mixed behavior in boundary-current hotspots, with some trends not statistically significant [25].

For medium- to short-term trends, Yang et al. proposed a three-branch fusion algorithm for OFETR, in which the intensity branch ingests daily-averaged time series, the scale branch derives features from MMF-based pixel-level front masks, and a third branch uses optical flow to encode spatiotemporal changes [26]. Finally, the three branches are fused by a weighted sum to determine the evolutionary trend (strengthening or weakening) within a given time window (window-level), achieving high classification accuracy on their curated dataset. The method is computationally efficient and supports variable-length input sequences.

Current research on OFETR still has several limitations.

Most OFETR methods adopt a cascaded paradigm that first performs OFD and then conducts OFETR based on the resulting masks. This paradigm is sensitive to upstream recognition errors, which can be amplified in the OFETR stage. Moreover, the masks produced by OFD discard the original intensity distribution, often forcing the introduction of extra channels or handcrafted rules to compensate for the information gap, which in turn impedes effective information sharing between the two tasks.
Most short-term OFETR methods produce results that are restricted to the window level and therefore lack explicit representations of point-level changes at a temporal resolution commensurate with the underlying data. As illustrated in Figure 1, relying solely on window-level labels can obscure the within-window, fine-grained evolution of the front, including the strengthening during the stage shown in Figure 1a–d and the weakening during the stage shown in Figure 1d–f. Moreover, because available datasets differ markedly in temporal resolution, and because many studies now analyze daily or even subdaily scales, a single window-level output is increasingly inadequate for probing the mechanisms that govern front evolution [7,8,9,10,11,12,13,14].

To address these limitations and better exploit the shared spatiotemporal structure of OFD and OFETR, we propose a multi-task learning framework based on 3D U-Net. We attach a lightweight OFETR head to a 3D U-Net backbone, enabling the model to simultaneously produce daily OFD masks and daily OFETR trends. This design reduces cascaded error propagation, maintains a modest parameter count, and keeps outputs timely at the daily scale.

The contributions of this study are as follows.

We introduce a 3D U-Net-based multi-task learning framework that enables OFETR to explicitly share the spatiotemporal representations learned by OFD, avoiding the information loss caused by treating the two tasks in isolation. Under conditions where OFD varies only slightly, our framework improves OFETR while significantly reducing the parameter count and suppresses cascaded error propagation by design.
The model is end-to-end and requires only SST gradient heatmaps as input, without any external data sources, thereby simplifying the data collection and implementation pipeline.
Compared with window-level outputs, the framework directly provides point-level trends, preserving the strengthening and weakening processes within a window and aligning more closely with studies of the mechanisms of front evolutionary trend.

2. Materials and Methods

This section describes the datasets and the proposed multi-task learning framework.

2.1. Dataset

We use the ESA SST CCI Level-4 Analysis product v3.0 [27], which provides global daily mean SST from 1980 to 2024 at a spatial resolution of 0.05 °C × 0.05 °C. From this product, we extract a 2002–2021 subset over 118–126°E and 24–32°N, covering the Zhejiang–Fujian Coastal Front and the Kuroshio front. From this subset, we construct two task-specific datasets targeting OFD and OFETR, respectively, with annotations focused primarily on the Zhejiang–Fujian Coastal Front. All annotations are produced by a single annotator.

2.1.1. Zhejiang–Fujian Coastal Front Mask Dataset

The Zhejiang–Fujian Coastal Front Mask (ZFCFM) dataset targets OFD. To avoid treating coastal ocean pixels as missing due to the land mask, we compute gradients with a filled Sobel operator that applies a local boundary treatment during gradient computation. For each valid ocean pixel, we examine its

3 \times 3

neighborhood. We adopt the local boundary treatment of Cao et al. [28]. If the central pixel has a valid SST value and at least one neighbor is invalid, we temporarily replace those invalid neighbor values with the SST of the central pixel only when applying the Sobel kernel, without modifying the underlying SST field. This boundary treatment prevents land contamination and keeps the valid SST statistics unchanged. In our study region, approximately 3.43% of valid ocean pixels have at least one invalid neighbor in their

3 \times 3

neighborhood. Finally, we crop the resulting gradient-magnitude maps to the target region, and the resulting images have a spatial size of

160 \times 160

.

To guide the choice of the normalization range for the gradient heatmaps, we analyze the empirical cumulative distribution function (ECDF) of gradient magnitude over all valid ocean pixels in 2002–2021 in Figure 2a. The heatmaps used in this study are computed from gradient-magnitude maps after applying a fixed min–max normalization with lower and upper bounds of 0 °C/100 km and 20 °C/100 km, respectively, before applying the jet colormap. Values above 20 °C/100 km are saturated by this conversion and account for only about 0.1% of all valid ocean pixels. If the upper bound were much larger, color differences at low gradient magnitude would become difficult to discern. If it were much smaller, too many high gradient-magnitude values would be discarded. We therefore adopt 20 °C/100 km as a compromise between these two effects. This fixed normalization range also keeps the color mapping consistent across all images so that relative differences in front intensity are visually comparable from day to day.

Pixel-level front masks are manually annotated on the resulting gradient heatmaps. After annotation, we refine the masks by removing pixels whose gradient magnitude is below 3 °C/100 km. As indicated by Figure 2a, pixels with gradient magnitude below 3 °C/100 km account for about 80.3% of all ocean pixels, so this threshold restricts the masks to the upper tail of the gradient distribution. Figure 2b shows the ECDF of gradient magnitude within the annotated masks, where pixels above 20 °C/100 km represent about 0.71% of mask pixels. When the filled Sobel operator is used in the preprocessing, the number of front pixels in the masks increases by about 5.44% compared with the masks derived from unfilled gradients. In total, masks are created for 7305 days.

Figure 3a summarizes the monthly ocean-only coverage of the Zhejiang–Fujian Coastal Front using all observations in the ZFCFM dataset for 2002–2021. Coverage is lower from June to October and higher in January–March and December, indicating that scene difficulty varies by month. The shaded band shows the (95%) confidence interval of the cross-year monthly mean and lies close to the mean line for most months, implying limited interannual variability within the same calendar month.

2.1.2. Zhejiang–Fujian Coastal Front Evolutionary Trend Dataset

The Zhejiang–Fujian Coastal Front Evolutionary Trend (ZFCFET) dataset is designed for the OFETR task. Following [24,25], we define front intensity as the mean gradient within the front area. Let the day index be

d = 1, 2, \dots, D

, let

G (d) \in R_{+}^{m \times n}

denote the gradient field, and let

C (d) \in {0, 1}^{m \times n}

denote the annotated front mask from the ZFCFM dataset on day d. The daily intensity

s (d)

is

s (d) = \frac{\sum_{i = 1}^{m} \sum_{j = 1}^{n} C_{i j} (d) G_{i j} (d)}{\sum_{i = 1}^{m} \sum_{j = 1}^{n} C_{i j} (d)} .

Define the day-to-day change

Δ s (d) = s (d) - s (d - 1)

for

d = 2, \dots, D

. We first assign an initial label by the sign of

Δ s (d)

: strengthening if

Δ s (d) > 0

, weakening if

Δ s (d) < 0

, and neutral if

Δ s (d) = 0

. Figure 2c shows the ECDF of the absolute change

| Δ s (d) |

.

To reduce the impact of SST noise, annotation errors in ZFCFM, and short-term front variability, we apply a two-step labeling rule (Algorithm 1) with L_min = 2 and τ = 0.5 °C/100 km to obtain the final labels

l (d)

. Step 1 confirms labels for consecutive days with a consistent nonzero sign over at least L_min days. Step 2 revisits the remaining single-day segments. If

| Δ s (d) | \geq τ

, the day keeps its own sign. Otherwise, if it is located between two confirmed segments that have the same sign, the day inherits this common label. Intuitively, this rule only treats a trend as stable when at least two consecutive days change in the same direction, and it smooths isolated small-magnitude fluctuations by aligning them with neighboring segments that show a persistent sign. Days that remain unlabeled after Algorithm 1 are left for manual inspection.

We assess the sensitivity of this procedure in Figure 4. The figure reports the fraction of days that still require manual confirmation after Algorithm 1 for different choices of

L_{min}

and

τ

. Increasing

L_{min}

from two to three or four days causes a marked rise in the proportion of days that require manual labeling. In contrast, the choice of the absolute threshold

τ

has only a minor influence on the manual fraction. Across the range of tested

τ

values, the manual fraction varies by about 1.08 to 1.37 percentage points (pp). Under our default setting with L_min = 2 and τ = 0.5 °C/100 km, 7.76% of days require manual confirmation. These default settings reflect a compromise between trend stability and annotation effort. With

L_{min} = 2

, the rule already requires two consecutive days with the same sign, which reduces the number of ambiguous single-day changes that would otherwise have to be checked manually. The threshold τ = 0.5 °C/100 km was selected as the candidate value closest to the 95th percentile of

| Δ s (d) |

in Figure 2c, so only the strongest day-to-day changes are treated as potentially meaningful and smaller fluctuations are regarded as noise. In practice, Algorithm 1 automatically labels the large majority of days, while the remaining days are inspected and, if necessary, corrected by a human annotator according to the visual criteria, so a small degree of subjectivity remains in the final labels.

Algorithm 1 Two-step rule for daily trend labels

Require:: Day-to-day changes $Δ s (d)$ for $d = 2, \dots, D$ , minimum run length $L_{min}$ , threshold $τ$
Ensure:: Labels $l (d) \in {- 1, 0, 1}$ for $d = 2, \dots, D$
1:: $l (d) \leftarrow 0$ and $σ (d) \leftarrow sign (Δ s (d))$ for $d = 2, \dots, D$
2:: $R \leftarrow \emptyset$
: Step 1: consecutive days
3:: $d \leftarrow 2$
4:: while $d \leq D$ do
5:: $s \leftarrow σ (d)$ , $t \leftarrow d$
6:: while $t \leq D$ and $σ (t) = s$ do
7:: $t \leftarrow t + 1$
8:: $L \leftarrow t - d$
9:: append $(d, L, s)$ to $R$
10:: if $s \neq 0$ and $L \geq L_{min}$ then
11:: for $u \leftarrow d$ to $t - 1$ do
12:: $l (u) \leftarrow s$
13:: $d \leftarrow t$
: Step 2: single-day cases
14:: let $R = {(d_{r}, L_{r}, s_{r})}_{r = 1}^{R}$
15:: for $r \leftarrow 1$ to R do
16:: if $L_{r} \neq 1$ then
17:: continue
18:: $d_{0} \leftarrow d_{r}$
19:: if $| Δ s (d_{0}) | \geq τ$ then
20:: $l (d_{0}) \leftarrow s_{r}$
21:: else if $1 < r < R$ then
22:: $(\cdot, L_{r - 1}, s_{r - 1}) \leftarrow R [r - 1]$
23:: $(\cdot, L_{r + 1}, s_{r + 1}) \leftarrow R [r + 1]$
24:: if $s_{r - 1} \neq 0$ and $s_{r + 1} \neq 0$ and $s_{r - 1} = s_{r + 1}$ and $l (d_{0} - 1) \neq 0$ and $l (d_{0} + 1) \neq 0$ then
25:: $l (d_{0}) \leftarrow l (d_{0} - 1)$
26:: return $l (d)$ for $d = 2, \dots, D$

If we rerun Algorithm 1 with the same thresholds but using gradients computed without the filled Sobel operator, the resulting labels differ on 253 days. Among days that are not flagged for manual review, 125 labels flip between strengthening and weakening. These numbers suggest that the filling step affects only a small fraction of the time series. However, it can still change the trend classification on some days, most likely where gradients are weak or close to nearshore missing values.

Days that remain unlabeled after Algorithm 1 are reviewed using daily gradient heatmaps. Figure 5 illustrates one typical case from 27 October to 2 November 2004, where the sign of

Δ s (d)

alternates and Algorithm 1 automatically labels 28 and 29 October as weakening and 1 and 2 November as strengthening, while 30 and 31 October remain unresolved. When such an unresolved segment is bracketed by two already confirmed segments, we search for turning points inside the gap. Candidate turning days are identified using local extrema of

Δ s (d)

, for example, 29 October and 31 October in Figure 5. For each candidate, we compare the evolution before and after the candidate day along the following qualitative criteria: (1) whether the frontal shape and area exhibit a clear change, (2) whether the core region of high gradient magnitude shows systematic intensification or weakening, (3) whether changes along the frontal edges correspond to widespread low-gradient fading or to localized high-gradient sharpening, and (4) for weak fronts with few pixels, whether the number of frontal pixels supports the hypothesized trend. These visual indicators are interpreted in a way that is consistent with the candidate trend. In a strengthening segment we expect the high-gradient area to intensify, the frontal edges to become sharper, and the number of pixels to increase for small fronts, while the opposite holds for weakening segments.

In the example in Figure 5, the evolution from panel (c) to panel (g) shows little change in overall shape, with enhanced core gradients and clearer edges, which is consistent with a strengthening trend. In contrast, when we assume panel (e) as the turning day and examine the evolution from panel (a) to panel (e), the change in core gradients does not match a clear weakening pattern and the edges only become slightly blurred. Under these criteria, panel (c) receives stronger support as the turning point than panel (e), so panels (d) and (e) are labeled as strengthening. If the comparison between candidates still remains ambiguous, we prefer the configuration with the fewest turning points and select candidate days corresponding to the local maxima or minima of

Δ s (d)

, depending on the type of turning point. For longer, unresolved segments, we apply the same logic and may introduce multiple turning points if required. When the segments before and after the unresolved interval share the same trend, we distinguish between smoothly evolving cases and abrupt changes and judge them according to the same criteria.

After manual labeling, we perform a second consistency check on days that were automatically labeled by Algorithm 1. We apply the same qualitative criteria as above when reassessing these automatically labeled days. We particularly inspect sequences of at least four consecutive days where the difference between the maximum and minimum of

s (d)

is smaller than 0.1 °C/100 km, days adjacent to turning points where

| Δ s (d) |

is smaller than 0.2 °C/100 km, and days that were filled in by Step 2 with

| Δ s (d) | < τ

. In total, 2558 days are revisited in this second pass, and the labels of 116 days are corrected.

The finalized classification set contains 7304 labeled days, among which 3151 are strengthening and 4153 are weakening.

Figure 3b shows that in the ZFCFET dataset, most months tend toward weakening. The solid line denotes the cross-year mean of the year–month strengthening proportion. Only in July, November, and December does the strengthening proportion exceed the weakening proportion. The shaded band presents the 95% confidence interval across years, and its typical width is about 10 percentage points. This indicates modest interannual differences within the same calendar month while the overall monthly pattern remains similar.

Independently of our point-level labeling method, we characterize long-term trends in the daily Zhejiang–Fujian Coastal Front intensity. Overall, the series shows a weak, nonsignificant weakening during 2002–2011 and a significant strengthening during 2012–2021. For 2002–2011, as shown in Figure 6a, the decadal slope is slightly negative, −0.081 °C/100 km per decade, with a relative change of −1.42% per decade and a Mann–Kendall p-value of 0.067. During 2012–2021, as shown in Figure 6b, the slope becomes positive and statistically significant, + 0.194 °C/100 km per decade, with a relative change of +3.45% per decade and a Mann–Kendall p-value of

3.05 \times 10^{- 4}

. To quantify these trends, we adopt a nonparametric framework comparable to Xing et al. [24]. We aggregate the daily intensity to monthly means and smooth the series with a centered 12-month moving average to attenuate intra-annual variability. A Pettitt change-point test on the smoothed monthly series indicates a statistically significant break near February 2012 (

K = 6242

,

p = 7.65 \times 10^{- 9}

), which motivates the use of segment-wise monotonic trends estimated by the Theil–Sen slope with Mann–Kendall significance rather than a single trend over 2002–2021.

We split the data temporally into disjoint sets: training (2002–2013), validation (2014–2017), and test (2018–2021). To allow the model to leverage longer temporal context for each prediction, we construct time-series clips of length

2 t

days with an overlap of

t + 1

days, and we set

t = 3

in this study, which yields 6-day clips with 4-day overlap. For ZFCFET, the first day of each clip is unlabeled because trend targets describe day-to-day changes. Consequently, each

2 t

-day clip contains

2 t

ZFCFM masks and

2 t - 1

ZFCFET labels aligned to days

2, 3, \dots, 2 t

. With

t = 3

, this yields 6 inputs and 5 labels per clip. Clip construction is performed within each split only. No clip crosses a split boundary and no samples are shared across splits. This results in 2190 clips for training and 729 clips for both validation and test.

2.2. Method

During dataset construction, we observed that OFETR and OFD are highly correlated in their spatiotemporal representations. Cascaded paradigms propagate and magnify upstream errors through the feature pipeline. They also yield only window-level labels, which fail to capture within-window, point-level evolution.

To address these limitations, we cast OFD and OFETR as heterogeneous multi-task learning on a shared 3D representation that jointly outputs daily OFD masks and daily OFETR trends [29]. This formulation explicitly shares spatiotemporal features and mitigates information fragmentation. It also removes the hard dependence on upstream masks and reduces error propagation by design. In addition, the method uses only SST gradient heatmaps as input, which keeps the data pipeline simple.

2.2.1. Network Architecture

Given the widespread adoption of 3D-CNNs for temporal image-sequence modeling and their well-documented ability to explicitly encode temporal features [30], we employ a 3D-CNN backbone for both OFD and OFETR. Following the U-Net family, we choose 3D U-Net [31] as the shared backbone. Although originally proposed for 3D medical image segmentation, 3D U-Net and related 3D-CNNs have been successfully adapted to sequence-based tasks [32,33].

Figure 7 provides an overview of our multi-task network. The input is a clip of length T (we use

T = 6

), and the encoder channel widths are 64, 128, 256, 512. To preserve temporal information, we neither downsample nor upsample along the time axis, and the temporal stride is set to 1 throughout. Spatial downsampling is performed by max pooling with a

1 \times 2 \times 2

kernel and stride

1 \times 2 \times 2

, and spatial upsampling is implemented by a transposed 3D convolution with the same spatial kernel size and stride. After the final decoder, a

1 \times 1 \times 1

convolution and a Sigmoid produce daily OFD masks of shape

(T, H, W)

.

Training 3D-CNNs carries overfitting risk on small datasets due to larger parameter counts. To improve efficiency and generalization, we replace all 3D convolutions in the encoder and decoder with the Pseudo-3D (P3D) blocks of Qiu et al. [34]. Since our clips overlap temporally, vanilla 3D convolutions introduce redundant computation. To address this, P3D significantly reduces both the parameter count and floating-point operations (FLOPs). Unless otherwise specified, P3D denotes the P3D-A variant throughout this paper. Each convolution in P3D is followed by GroupNorm (32 groups) and ReLU.

On top of the shared layers, the OFETR head first applies a temporal convolution (kernel size 3) to extract short-range dynamics while halving the channel dimension. In Figure 7, the head is shown attached at the first decoder (d1). Other attachment points are discussed in Section 3. A subsequent P3D layer strengthens spatiotemporal interactions. We then perform global average pooling over the spatial dimensions to obtain a daily temporal feature sequence of length T. To match the label definition in ZFCFET, a final temporal convolution with a

2 \times 1 \times 1

kernel and stride 1 outputs

T - 1

point-level trend labels, corresponding to days 2 through T of the input clip.

Because 3D U-Net already maps gradient heatmaps to semantic masks, embedding a lightweight OFETR head into the same backbone enables multi-task learning with minimal architectural changes, avoiding a separate end-to-end model for trend recognition. This substantially reduces the overall parameter count and computational cost. Under hard parameter sharing, OFD and OFETR share low- and mid-level representations and exploit their latent relatedness, while OFETR does not take OFD outputs as upstream inputs, thereby eliminating the explicit error-cascade pathway.

2.2.2. Compared Model

Among all compared models, two methods require additional clarification because their original formulations are not directly compatible with the datasets constructed in this study.

For CCAIM, we use the global daily front dataset of Xing et al. [35], which is also built from the ESA SST CCI Level-4 analysis [27]. We extract the CCAIM frontal lines over our study region and period and use them as a traditional baseline. Since ZFCFM provides frontal zones rather than lines, we convert the CCAIM lines into frontal zones using the dilation scheme of Xing et al. [35]. Concretely, we first reconstruct SST on the analysis grid using inverse distance weighting (IDW) then compute gradient magnitudes with the improved Sobel operator of Xing et al. [19] and apply the same logarithmic transformation as in Xing et al. [35]. We then partition the domain into Voronoi cells defined by the detected front lines. Within each cell, the line is expanded by geodesic dilation so that only pixels connected to the original frontal line are added, and the dilation stops when the gradient magnitude falls below r times the nearest line gradient, where r is the chosen gradient-ratio threshold. This procedure yields nonoverlapping frontal zones that are directly comparable to the ZFCFM masks.

For each day, CCAIM typically detects several frontal lines with associated frontal zones

{Z_{k}}

. Let G denote the corresponding ZFCFM mask. For any index set S of CCAIM lines, we define

I (S) = \frac{|(⋃_{k \in S} Z_{k}) \cap G|}{|(⋃_{k \in S} Z_{k}) \cup G|} .

Starting from

S_{0} = \emptyset

, at each step, we add the index

k^{★}

that yields the largest increase in

I (S)

and update

S_{t + 1} = S_{t} \cup {k^{★}}

until the IoU no longer improves. The final union of selected zones approximates the hand-labeled frontal zone for that day. We use this greedy selection only to construct comparable reference masks for CCAIM, rather than as a deployable front-selection procedure.

For the ETR algorithm [26], we adapt its three-branch architecture to the ZFCFET dataset. The strength branch operates on an OFD result: for each day, we take the front mask predicted by our single-task 3D U-Net and compute the daily mean gradient magnitude within this mask, consistent with our definition of front intensity. The scale branch uses, for each day, the number of front pixels inside the predicted mask as a measure of front area. The optical-flow branch takes as input optical-flow fields computed from the SST gradient heatmaps, where the gradients are mapped to RGB using the jet colormap before flow estimation. For the strength and scale branches, we first apply the same temporal interpolation scheme as in Yang et al. [26] and then perform uniform average pooling along time to obtain fixed-length 40-dimensional vectors, instead of using the nonuniform temporal sampling strategy in the original implementation, so as to preserve the temporal evolution as evenly as possible. For the optical-flow branch, we use a GoogLeNet Inception-v3 backbone pretrained on ImageNet [36], modify its first convolutional layer to accept two input channels, and feed it the two-channel optical-flow fields. Finally, we adjust the classification heads so that the outputs of all three branches are combined into day-level trend labels that match the ZFCFET format and can be evaluated with the same metrics as our OFETR models.

2.2.3. Loss

For OFD, we adopt the sum of a positive-class weighted binary cross-entropy (BCE) loss and a Dice loss as the pixel-level objective.

Let a batch contain m pixels after unfolding over space and time. For pixel i, denote the OFD logit by

z_{i}^{OFD}

, the probability by

p_{i}^{OFD} = σ (z_{i}^{OFD}) = \frac{1}{1 + e^{- z_{i}^{OFD}}}

, and the binary label by

y_{i}^{OFD} \in {0, 1}

. Let

m_{+} = \sum_{i = 1}^{m} y_{i}^{OFD}

and

m_{-} = m - m_{+}

be the numbers of positive and negative pixels in the batch. We define the positive-class weight as

w^{BCE} = \frac{m_{-}}{m_{+} + ε_{BCE}}, ε_{BCE} = 10^{- 6} .

The weighted per-pixel BCE loss is

l_{i}^{BCE} = - [w^{BCE} y_{i}^{OFD} log p_{i}^{OFD} + (1 - y_{i}^{OFD}) log (1 - p_{i}^{OFD})],

and the batch-averaged BCE is

L_{BCE} = \frac{1}{m} \sum_{i = 1}^{m} l_{i}^{BCE} .

We define the Dice loss as

L_{Dice} = 1 - \frac{2 \sum_{i = 1}^{m} p_{i}^{OFD} y_{i}^{OFD} + ε_{Dice}}{\sum_{i = 1}^{m} p_{i}^{OFD} + \sum_{i = 1}^{m} y_{i}^{OFD} + ε_{Dice}}, ε_{Dice} = 10^{- 6} .

Here,

L_{BCE}

measures the discrepancy between labels and predictions at the pixel level, while

L_{Dice}

emphasizes the overlap between predicted and ground-truth regions at the set level. The OFD loss is

L_{OFD} = L_{BCE} + L_{Dice} .

ZFCFET exhibits mild class imbalance, so we adopt a normalized class-weighted cross-entropy. Let the dataset contain N samples and K classes with labels

Y_{j}^{OFETR} \in {0, 1, \dots, K - 1}

. In our case,

K = 2

(strengthening and weakening). Let

N_{k}

be the number of samples in class k. The class weights are

w_{k}^{CE} = \frac{N}{K N_{k}} (k = 0, 1, \dots, K - 1) .

In this work, the class weights

w_{k}^{CE}

are computed from the training split class counts and then fixed. The same training-derived weights are used when computing losses for the training, validation, and test splits.

For a batch of n samples, with logits

z_{j, k}^{OFETR}

and softmax probabilities

p_{j, k}^{OFETR} = \frac{exp (z_{j, k}^{OFETR})}{\sum_{k^{'} = 0}^{K - 1} exp (z_{j, k^{'}}^{OFETR})},

and one-hot targets

y_{j, k}^{OFETR} \in {0, 1}

, the per-sample loss is

l_{j}^{CE} = - \sum_{k = 0}^{K - 1} w_{k}^{CE} y_{j, k}^{OFETR} log p_{j, k}^{OFETR},

and the batch-averaged OFETR loss is

L_{OFETR} = \frac{1}{n} \sum_{j = 1}^{n} l_{j}^{CE} .

Let

L_{OFD}^{(t)}

and

L_{OFETR}^{(t)}

be the OFD and OFETR losses at training iteration t, with task weights

w_{OFD}

and

w_{OFETR}

. In this work, we use equal weights and define the total loss as

L^{(t)} = w_{OFD} L_{OFD}^{(t)} + w_{OFETR} L_{OFETR}^{(t)} = L_{OFD}^{(t)} + L_{OFETR}^{(t)} .

2.3. Metrics

Because our temporal clips overlap, a given calendar day can receive multiple predictions from different clips. For each task, we first align all predictions to calendar days and, whenever multiple clips contribute to the same day, we average the predicted probabilities over clips. For OFD, we then threshold the averaged probabilities at the pixel level to obtain a single merged binary mask for each day. For OFETR, we apply a Softmax over trend classes and take the argmax as the daily label at each spatial location.

For OFD, we report mean Intersection over Union (mIoU) and mean F1 (mF1) computed over daily merged masks. Let TP, FP, and FN denote the numbers of true positives, false positives, and false negatives computed on the merged daily mask,

IoU = \frac{TP}{TP + FP + FN}, Precision = \frac{TP}{TP + FP},

Recall = \frac{TP}{TP + FN}, F 1 = \frac{2 Precision \times Recall}{Precision + Recall} .

We compute IoU and F1 for each evaluation day and then average them over all days to obtain mIoU and mF1. IoU reflects the agreement between predicted and reference front regions, and F1 partly compensates for class imbalance by balancing Precision and Recall across days.

For OFETR, we report Accuracy and F1, with F1 defined as above. Let TP, FP, FN, and TN denote the numbers of true positives, false positives, false negatives, and true negatives aggregated over all evaluation days and spatial locations. The accuracy is

Acc = \frac{TP + TN}{TP + TN + FP + FN} .

Accuracy and F1 are the main metrics that we report for OFETR and that we use for model selection in the main experiments. In addition to these threshold-based metrics, we compute the area under the receiver operating characteristic curve (AUROC) for each configuration. AUROC provides a threshold-independent view of model behavior across all decision thresholds. It is less sensitive to moderate class imbalance and to the exact choice of threshold. We present Accuracy and F1 in the main results tables. AUROC appears in the text and figures as a complementary metric for analyzing training dynamics and comparing the relative robustness of different OFETR configurations.

3. Results

All experiments are conducted on a single NVIDIA RTX 4090 GPU. Unless otherwise noted, deep learning models are trained with the Adam optimizer, a learning rate of

1 \times 10^{- 4}

, a StepLR scheduler with a step size of 10 and decay factor

γ = 0.8

, a batch size of 10, and 100 training epochs. In the ETR baseline, the classification heads use a learning rate of

1 \times 10^{- 3}

, while the optical-flow branch and all other deep learning models use a learning rate of

1 \times 10^{- 4}

with the same scheduler settings.

For OFD, we compare widely used 2D CNN baselines and models designed specifically for front detection. The 2D baselines include U-Net [37], U-Net++ [38], DCENet [21], and Res-U-Net [22]. As a traditional baseline for OFD, we further include CCAIM [19] based on the global front dataset of Xing et al. [35]. For OFETR, we compare our 3D U-Net-based models with the ETR algorithm of Yang et al. [26] adapted to the ZFCFET dataset.

Within the 3D U-Net family, we evaluate four settings: single-task OFD, single-task OFETR, multi-task learning where OFD and OFETR are optimized jointly, and a cascade configuration that mimics error propagation in practical detection. We denote these configurations by 3D U-Net-S (single-task), 3D U-Net-M (multi-task), and 3D U-Net-C (cascade), respectively. When an OFETR head is used, the attachment point is indicated in parentheses (e.g., 3D U-Net-S(e4)), and models without an attachment point (e.g., 3D U-Net-S) are trained only for OFD. For the single-task and multi-task configurations, we attach the OFETR head at different encoder or decoder stages and study how the attachment position affects the trend classification performance. In the cascade setting, we first train a single-task 3D U-Net for OFD and then use its predicted OFD masks as inputs for a separate single-task OFETR model. Specifically, the predicted mask is concatenated with the three-channel SST gradient heatmap to form a four-channel input for OFETR. During training, validation, and testing of the cascade OFETR model, we do not use the ground-truth masks and keep all other settings identical to the corresponding single-task OFETR configuration.

For CCAIM, we convert frontal lines into frontal zones as described in Section 2.2.2. The gradient threshold r used to expand lines into frontal zones is selected on the ZFCFM validation set by grid search over

{0, 0.05, \dots, 1.0}

and is then kept fixed when evaluating on the test set. For the ETR algorithm, we combine the outputs of the three branches by a weighted sum. The weight of each branch is selected on the validation set by grid search over a small discrete set of integer coefficients in

{- 2, - 1, 0, 1, 2}

and is then kept fixed on the test set.

Model selection on the validation set is based on

J = \frac{{mIoU}_{OFD} + {Acc}_{OFETR}}{2} .

If a model does not produce OFD outputs, we set

{mIoU}_{OFD} = 0

when computing J. If a model does not produce OFETR outputs, we set

{Acc}_{OFETR} = 0

. The checkpoint that achieves the highest J on the validation set is evaluated on the test set. All deep learning experiments are run with the same five random seeds, and results for these models are reported as mean ± standard deviation (%) over the five runs.

3.1. Comparison Across Models

Where appropriate, we assess pairwise differences using two-sided paired t-tests on per-run scores and control the family-wise error rate at

α = 0.05

using Holm–Bonferroni correction within each group of related models.

For CCAIM, we set the gradient-ratio threshold to

r = 0.5

, which is the best value obtained from a grid search on the ZFCFM validation set and coincides with the value used in the global daily front dataset of Xing et al. [35]. The resulting performance is reported in Table 1. Even with this tuned setting, the IoU between the CCAIM-based frontal zones and the ZFCFM masks remains relatively modest. When applied to ZFCFM, CCAIM is subject to two practical limitations. First, some small frontal zones in the nearshore region are not captured by the detected front lines and therefore cannot be recovered by the subsequent dilation. Second, the ZFCFM masks are constructed by retaining only pixels whose SST gradient magnitude exceeds 3 °C/100 km, whereas the ratio-based thresholding in CCAIM may assign pixels with smaller gradients to the frontal zones. This mismatch between the absolute-gradient filtering in ZFCFM and the ratio-based expansion rule in CCAIM may lead to additional over- and under-coverage at the front boundaries, which is consistent with the moderate IoU values observed.

Among the 2D CNNs, U-Net, U-Net++, DCENet, and Res-U-Net all reach high segmentation accuracy. The 3D U-Net-S attains OFD performance that is close to the strongest 2D baselines while using considerably fewer parameters than U-Net and U-Net++. Compared with the lighter Res-U-Net, 3D U-Net-S remains competitive and slightly improves both mIoU and mF1, at the cost of higher FLOPs due to the temporal dimension. Within the OFD baseline family (DCENet, 3D U-Net-S, and the multi-task 3D U-Net-M(e4)), paired t-tests on per-run mIoU show that the difference between DCENet and 3D U-Net-S is not statistically significant after Holm–Bonferroni correction (

p \approx 0.053

), whereas 3D U-Net-M(e4) exhibits a small but significant decrease of about 0.95 percentage points relative to DCENet (

p \approx 0.0013

).

For single-task OFETR, 3D U-Net-S with the head attached at intermediate or decoder stages achieves the highest accuracy and F1. Attachments near the middle encoder or early decoder stages generally offer a good balance between performance and efficiency. Very shallow encoder attachments lose some discriminative power. Very deep decoder attachments increase computational cost and do not consistently improve classification. These patterns suggest that the front evolutionary trend benefits from features that are neither too low level nor too close to the final OFD logits.

We also report results for the ETR algorithm adapted to the ZFCFET dataset. In the modified ETR model, we search over different random seeds when training the three-branch classifier. Among five runs, the selected branch weights for the strength, scale, and optical-flow paths are (−1, 2, 1) in two runs, (0, 1, 1) in one run, (2, 2, 1) in one run, and (−2, 1, 2) in one run. Compared with the window-level dataset used by Yang et al. [26], these configurations indicate that the strength and optical-flow branches contribute more strongly in our daily, pixel-level setting. At the same time, the relatively large variation in the strength weight suggests that the linear classifier in the strength branch may not reliably capture day-to-day trend signals from the front-masked mean SST gradients.

In the multi-task setting, OFD and OFETR respond differently to the attachment position. For OFD, shallow attachments preserve segmentation accuracy best. Multi-task 3D U-Net with early encoder attachments maintains mIoU and mF1 close to the single-task 3D U-Net, whereas deeper attachments lead to noticeable degradation in OFD metrics. For OFETR, most attachment points gain accuracy and F1 under multi-task learning compared with the corresponding single-task OFETR models. However, the deepest decoder attachments exhibit reduced trend-classification performance, which indicates competition between tasks at these stages. Considering both tasks together, mid encoder or early decoder attachments in the multi-task models provide a favorable trade-off, giving consistent OFETR improvements while limiting the loss in OFD. Within the 3D U-Net-M family, the OFETR accuracies of the e3, e4, and d1 attachments differ by at most about 0.6 percentage points, and paired t-tests with Holm–Bonferroni correction do not detect significant differences among these three configurations (all

p > 0.08

), so we regard them as broadly comparable in OFETR performance.

The cascade 3D U-Net-C variants further improve OFETR performance. In this setting, the OFD model is trained first, and its predicted masks are used as an additional input channel for a separate OFETR model. The cascade attachments reach the highest OFETR accuracy and F1 among all deep learning models in Table 1, particularly for encoder attachments. At the shared e4 attachment point, 3D U-Net-M(e4) improves OFETR accuracy over 3D U-Net-S(e4) by about 1.8 percentage points, and the cascade 3D U-Net-C(e4) further improves accuracy by about 1.6 percentage points relative to 3D U-Net-M(e4) and by about 3.4 percentage points relative to 3D U-Net-S(e4). All of these gains are statistically significant under paired t-tests with Holm–Bonferroni correction within this family (

p \leq 0.013

). This pattern indicates that feeding realistic OFD predictions into OFETR can guide the trend classifier toward front-related regions, even when the OFD masks are imperfect.

Although the cascade configuration achieves the highest OFETR Accuracy and F1 when driven by high-quality OFD masks, the margin over the best single-task and multi-task configurations (3D U-Net-S(d2) and 3D U-Net-M(e4)) is about 2.7 and 2.4 percentage points in accuracy, respectively, and both improvements are statistically significant under paired t-tests with Holm–Bonferroni correction (

p \leq 0.0041

), whereas the accuracy gap between the best single-task and multi-task models is small (approximately 0.3 points) and not significant (

p \approx 0.55

). This comes at the cost of a more fragile pipeline. The OFETR head in the cascade setting relies almost entirely on the binary segmentation results and thus inherits any systematic mis-detections in OFD. In contrast, the multi-task 3D U-Net jointly uses the gradient fields and shared 3D features and does not condition its trend prediction on a hard mask. As a result, multi-task OFETR attains slightly lower peak scores than the cascade upper bound but offers a simpler, single-stage design that is less sensitive to segmentation errors.

Figure 8 tracks validation dynamics for OFD and OFETR and contrasts single-task 3D U-Net baselines with multi-task variants where the OFETR head is attached at specific stages. We compare strong single-task reference models with the single-task 3D U-Net and the corresponding multi-task models whose OFETR heads are attached at e3, e4, or d1 (d2 is the single-task reference for OFETR). OFD is evaluated by mIoU, whereas OFETR is evaluated by AUROC.

For OFD in Figure 8a–c, the DCENet reference remains consistently above the 3D U-Net-S curve for most epochs, and the 3D U-Net-S in turn remains above the multi-task counterparts that share backbone weights. This ordering is also reflected in their convergence behavior. DCENet reaches a plateau earlier in training, followed by the 3D U-Net-S, while the multi-task variants converge more slowly, especially when the OFETR head is attached deeper in the network. At the same time, DCENet exhibits noticeable oscillations in mIoU during roughly the first ten epochs. As training progresses, these oscillations diminish and all OFD models converge to stable performance levels.

For OFETR in Figure 8d, which compares 3D U-Net-S with the head attached at d2 and e3 as well as the multi-task model attached at e3, the d2 baseline peaks between roughly ten and twenty epochs and then gradually decreases. Both e3 curves peak later, between about twenty and thirty epochs. The multi-task e3 curve remains above its single-task counterpart for most epochs and exhibits a smoother trajectory, which indicates that multi-task training can stabilize the trend-classification dynamics at this attachment position. In Figure 8e, the strongest peaks for 3D U-Net-S with the head attached at d2 and for the multi-task model attached at e4 both occur around ten to twenty epochs, whereas the single-task e4 curve peaks slightly later. The multi-task e4 curve reaches a higher peak than d2 and exhibits only a modest decrease afterward, while the single-task e4 curve shows a more pronounced decline after its maximum. In Figure 8f, which compares d2 with d1 single-task and d1 multi-task configurations, all curves reach their maxima near ten to twenty epochs and then decrease. The multi-task d1 curve maintains higher AUROC than the single-task d1 curve for most epochs, again indicating a benefit from joint optimization with OFD at this attachment position.

Across Figure 8d–f, moving the OFETR head deeper in the network tends to shift the epoch of maximum AUROC earlier in training. The peak appears earlier for d2 than for e4 and earlier for e4 than for e3. Within the multi-task setting, the peak AUROC at d1 exceeds that at e4 and the peak at e4 exceeds that at e3. This pattern indicates a depth-dependent gain in early OFETR performance while preserving the stability advantage of multi-task training relative to the corresponding single-task models. Taken together, these observations suggest that the potential of multi-task learning for OFETR is not yet fully exploited.

We further visualize the OFD results in Figure 9. Panels (a)–(f) show the reference SST gradient heatmaps for 4 to 9 April 2018. In this period, several distinct frontal zones appear within the target region. Panels (g)–(l) give the corresponding DCENet predictions, and panels (m)–(r) show the OFD 3D U-Net-S predictions for the same dates.

From (g) to (j), we observe that DCENet sometimes locks onto a non-target front in the multi-front zone. In particular, (h) and (i) display large spurious frontal patterns along the top boundary. These correspond to neighboring fronts that should not be included in the target mask. At the same dates, the 3D U-Net-S predictions in (n) and (o) do not show such broad false detections and remain closer to the annotated frontal zone.

Comparing the later days, the DCENet outputs in (j)–(l) with the 3D U-Net-S outputs in (p)–(r), it is apparent that 3D U-Net-S tends to use information from neighboring days to keep tracking the same target front within the crowded frontal region. This is consistent with the behavior on 5 and 6 April, where (n) and (o) do not exhibit the large misdetections seen in (h) and (i). At the same time, the smoother temporal evolution in 3D U-Net-S can lead to small local misassignments between adjacent fronts on some days, which appear as narrow false frontal fragments around the main target band.

To better understand how the OFETR head exploits intermediate representations, we visualized the features at the last encoder stage (e4) immediately before the OFETR head for both the single-task and multi-task models. Specifically, Figure 10 shows the L2 norm of the upsampled e4 feature vectors at each pixel for the 3D U-Net-S(e4) and 3D U-Net-M(e4) configurations over the six-day period from 1 to 6 January 2018. Each panel displays the feature-norm map as a 2D heatmap, and the white contour indicates the reference frontal mask from the ZFCFM dataset. Brighter colors denote larger feature norms, i.e., stronger activations. Columns are aligned by date across rows, so that each column directly compares the single-task (top row) and multi-task (bottom row) representations for the same day.

From panels (a)–(f), corresponding to the single-task model, we observe a gradual decrease in activation intensity within the frontal zone over the six-day sequence. A similar temporal trend is visible in the multi-task feature maps in panels (g)–(l). In both rows, small localized bright spots appear outside the frontal contour, indicating residual responses to background gradients. In addition, the spatial distribution of activations differs systematically between the two settings. The single-task model shows stronger responses toward the lower-left segment of the front, whereas the multi-task variant distributes its activations more uniformly along the entire frontal zone. This contrast is particularly evident when comparing panels (d)–(f) with panels (j)–(l), where the multi-task model more consistently highlights the full frontal structure rather than emphasizing a single subregion. This more uniform allocation of attention along the frontal zone is also more consistent with the way the ZFCFET dataset is constructed, in which trend labels are defined over the entire frontal zone rather than a specific subsection.

We also compare the monthly performance of different models, as shown in Figure 11. The heatmaps reveal similar month-by-month patterns across architectures, with OFD reported as mIoU and mF1 and OFETR reported as Accuracy and F1. For OFD, mIoU in (a) and mF1 in (b) consistently dip from March to May and again in September and reach their highest levels in December. This trend is shared by all OFD models. Under multi-task learning, the 3D U-Net-M rows lie slightly below the corresponding single-task 3D U-Net rows in most months, indicating a systematic loss in OFD when the backbone is shared with OFETR.

For OFETR, Accuracy and F1 in (c) and (d) peak in March and are lowest in October. The multi-task model with the OFETR head attached at e4 improves over its single-task counterpart in most months. The gains are modest but positive in almost every month except January, July, and October, where performance is similar or slightly lower. The cascade baseline, which uses OFD masks as input, attains the highest OFETR Accuracy and F1 in the majority of months and therefore represents the most optimistic configuration when good OFD masks are available.

Taken together, these results highlight clear month-dependent difficulty in both tasks. For OFD on ZFCFM, March–May and September are consistently harder, with lower mIoU and mF1 across models. For OFETR on ZFCFET, February, September, and especially October show reduced Accuracy and F1 for all architectures. At the same time, even in months where OFD is weaker, such as March–May, OFETR under multi-task and cascade settings remains higher than in the corresponding single-task OFETR baselines. This suggests that the OFD branch learns features that are useful for trend recognition and can provide a net benefit to OFETR, even when segmentation performance itself is not at its strongest.

3.2. Effect of Clip Length on Performance

In Section 2.1.2, we defined the clip length and overlap via two integers

T = 2 t

and

O = t + 1

. In the clip-length ablation, we vary

t \in {2, 3, 4}

, which correspond to

T \in {4, 6, 8}

and

O \in {3, 4, 5}

, respectively. Under this schedule, to obtain contiguous point-level predictions, producing one day of output requires processing

T + O

days of input. For example, for

t = 3

, one needs ten days.

We report computation as Daily FLOPs (G/day) by converting per-forward FLOPs to an effective daily cost implied by the sliding window. Consecutive clips overlap by

O = t + 1

days, so each forward contributes t new day-level predictions in the interior. Hence, the steady-state cost per day is

Daily FLOPs = \frac{{FLOPs}_{fwd}}{t} .

Formally, letting

N_{fwd} (D; t)

be the number of forwards used to produce D day-level outputs under

(T, O) = (2 t, t + 1)

, we compute

Daily FLOPs = {FLOPs}_{fwd} \cdot lim_{D \to \infty} \frac{N_{fwd} (D; t)}{D} = \frac{{FLOPs}_{fwd}}{t},

which matches the intuitive expression up to negligible boundary effects. For 2D-CNN baselines that operate per day without temporal overlap, each forward yields exactly one day-level output, so

Daily FLOPs = {FLOPs}_{fwd}

.

Table 2 reports OFD and OFETR performance together with Daily FLOPs for single-task and multi-task settings across the tested clip lengths.

For OFD, the single-task 3D U-Net exhibits a slight decrease in performance as T increases. With

T = 4

, mIoU and mF1 reach 92.80% and 96.13%. With

T = 6

, they decrease to 92.55% and 95.99%, corresponding to drops of 0.25 and 0.14 percentage points relative to

T = 4

. With

T = 8

, they are 92.48% and 95.96%, for overall decreases of 0.32 and 0.17 percentage points. In the multi-task setting, OFD at attachment points e3 and e4 also performs best with shorter clips, with the highest mIoU and mF1 at

T = 4

. At d1, the best segmentation appears at

T = 6

, while

T = 4

and

T = 8

remain close.

For OFETR, single-task models at e3 and e4 achieve their highest accuracy and F1 at

T = 4

. At e3, accuracy and F1 decrease from 84.52%/81.37% at

T = 4

to 84.09%/80.36% at

T = 6

and to 83.06%/79.30% at

T = 8

. At e4, the curves show a similar decreasing pattern. At d1,

T = 6

provides the strongest single-task performance with 85.15% accuracy and 82.16% F1. In the multi-task setting, the dependence on clip length is weak at e3, with a small trade-off between accuracy and F1 across T. At e4,

T = 6

yields the best scores. At d1,

T = 8

attains the highest multi-task performance with 85.45% accuracy and 83.07% F1, although all three lengths remain within about 0.3 percentage points of each other.

For computational cost, the 3D U-Net-S Daily FLOPs decrease from 79.82 G/day at

T = 4

to 59.87 G/day at

T = 6

, a reduction of about 25%. They further drop to 53.21 G/day at

T = 8

, which is an additional reduction of about 11% relative to

T = 6

. Multi-task variants follow the same pattern because their per-forward FLOPs scale in the same way with t. Notably, the single-task 3D U-Net at

T = 6

uses 59.87 G/day, about 11% higher than U-Net++ (53.97 G/day), while delivering comparable segmentation accuracy. At the same time, it reduces the computational cost by about 23% compared with DCENet, whose Daily FLOPs are 77.69 G/day.

Figure 12 compares multi-task models across clip lengths T in terms of mIoU for OFD and AUROC for OFETR at different attachment points.

For OFD in Figure 12a–c, all three attachment points show a similar ordering: the

T = 4

curve stays above

T = 6

for most epochs, and both remain above

T = 8

. The shorter clips therefore provide slightly higher segmentation accuracy throughout training, while longer clips preserve the same qualitative behavior of the curves at a lower computational cost.

For OFETR, the temporal response is more sensitive to the attachment point. At e3 in Figure 12d, all three curves reach their maxima around epochs 20–30 and then decline to a plateau. In the late-epoch plateau, AUROC follows the order

T = 4 > T = 6 > T = 8

, and the confidence bands for

T = 4

and

T = 6

are narrower than for

T = 8

. At e4 in Figure 12e, the peaks appear earlier, between epochs 10 and 20. Before the peak,

T = 4

is highest, followed by

T = 6

and

T = 8

. After the peak, the ranking reverses and

T = 6

maintains the highest AUROC,

T = 8

is second, and

T = 4

is lowest in the later epochs. At d1 in Figure 12f, the overall shape and peak timing are similar to e4, with all clip lengths peaking around epochs 10–20 and then decreasing, but the confidence band for

T = 6

is wider than those for

T = 4

and

T = 8

, indicating greater variability across seeds at this attachment point.

Taken together with the results in Table 2, these dynamics suggest that

T = 6

offers a robust compromise between accuracy and computational cost across both tasks and attachment points, and we therefore adopt

T = 6

in all subsequent experiments unless otherwise noted.

3.3. Effect of the P3D Block on Model Performance

To compare the impact of different 3D blocks, we evaluate the original 3D convolution (C3D) used in prior work and the P3D block under both single-task and multi-task settings. Table 3 summarizes the OFD and OFETR metrics together with the parameter count and daily FLOPs for these configurations.

For single-task OFD, replacing C3D with P3D causes only a small accuracy loss but yields large computational savings: mIoU decreases from 92.88% to 92.55% (−0.33 pp) and mF1 from 96.18% to 95.99% (−0.19 pp), while the parameter count and daily FLOPs drop from 17.70 M and 129.42 G/day to 9.08 M and 59.87 G/day, corresponding to reductions of about 48.7% in parameters and 53.7% in computation.

For single-task OFETR, P3D performs slightly worse than C3D at all attachment points. At e3, accuracy and F1 decrease by about 0.91 and 1.44 pp. At e4, the decreases are about 0.30 and 0.28 pp. At d1, accuracy and F1 drop by about 0.46 and 0.60 pp.

In the multi-task setting, the effect of replacing C3D with P3D on OFD is small at all attachment points. At e3, P3D improves mIoU and mF1 by +0.12 and +0.05 pp. At e4, the changes are negligible. At d1, mIoU and mF1 decrease by −0.15 and −0.08 pp. For OFETR, P3D yields consistent gains in the multi-task case. At e3, accuracy and F1 increase by +0.20 and +0.02 pp. At e4, the improvements reach +0.88 and +1.06 pp. At d1, the gains are +0.23 and +0.17 pp.

In terms of efficiency for multi-task models, switching from C3D to P3D roughly halves the daily FLOPs, from about 132.2 G/day to 62.63 G/day, a reduction of about 52.6%. The number of parameters decreases by about 43–47% across the three attachment points. Overall, P3D provides a favorable trade-off: it preserves OFD accuracy to within a few tenths of a percentage point, yields clear gains for multi-task OFETR at deeper attachments, and substantially lowers the computational cost.

Figure 13 compares P3D and C3D under identical training settings, with (a)–(c) reporting mIoU for OFD and (d)–(f) reporting AUROC for OFETR. For OFD, the two blocks show very similar learning dynamics at all attachment positions. At e3, e4, and d1, the C3D curve is slightly higher than P3D in the early epochs, while P3D tends to catch up and marginally exceed C3D later in training, and the gap remains small over the whole trajectory.

For OFETR, the relative ordering is more mixed. At e3 in (d), the two curves almost overlap for most epochs, and the peaks are very close. At e4 and d1 in (e) and (f), C3D is generally slightly higher than P3D over a broad range of epochs, although the margin is modest and the peak epochs remain similar. This behavior is consistent with the larger parameter count of C3D and suggests that the extra capacity mainly brings small gains in OFETR accuracy rather than qualitative changes in training dynamics.

Taken together with the quantitative results in Table 3, these curves indicate that P3D and C3D behave similarly during optimization. P3D trades a small amount of OFETR performance for large reductions in parameters and daily FLOPs while keeping OFD accuracy very close to that of C3D.

3.4. Effect of Input Representation

To examine how the input representation influences both tasks, we compare three encodings of the same gradient-magnitude field: a single-channel grayscale input and three-channel RGB inputs obtained by mapping the field through the jet and viridis color maps. Color choice is known to affect how structures appear in scientific images. For example, rainbow-like maps such as jet have nonuniform lightness and hue, which can create artificial boundaries and visually overemphasize certain ranges of values [39]. In contrast, perceptually designed maps such as viridis aim to provide a monotonic change in lightness so that equal data increments correspond to more uniform perceived contrast. All other settings, including architecture, clip length, and training schedule, are kept fixed. The resulting performance is summarized in Table 4.

For OFD, the three encodings behave very similarly across all attachment points. In the single-task 3D U-Net, the differences between grayscale and either color map are at most about 0.2–0.3 pp in mIoU and mF1. In the multi-task 3D U-Net, the largest gap between any pair of input types remains within roughly 1 pp. The relative ordering between grayscale and jet or viridis also varies with the attachment position. These results indicate that, for dense front segmentation on ZFCFM, the network can extract effective spatiotemporal features directly from the scalar gradient magnitude, and adding a three-channel color encoding does not produce a systematic gain.

For OFETR, the impact of color mapping is more noticeable. At all attachment points, both jet and viridis improve Accuracy and F1 over grayscale by about 1–3 pp, in both single-task and multi-task settings. In the single-task models, viridis tends to achieve slightly higher F1 than jet at a given attachment (typically within about 0.5–1 pp), while in the multi-task models, jet is often marginally stronger, especially at e4 and d1. Overall, the gap between grayscale and colored inputs is larger than the gap between the two color maps themselves.

In summary, within the two datasets constructed in this study, applying a fixed color map to the gradient-magnitude field is a viable alternative to using the raw grayscale input. Different color maps do lead to small but consistent differences in OFETR performance, whereas their impact on OFD is very limited. However, these differences between jet and viridis are generally smaller than the improvement obtained by switching from grayscale to any color representation on the OFETR task.

3.5. Sensitivity of Cascade Models to OFD Mask Errors

In Table 1, the cascade 3D U-Net models achieve the highest OFETR accuracy and F1 among all architectures, especially when the OFETR head is attached at e2. However, as discussed earlier, cascade designs may suffer from error propagation from the OFD branch. To quantify this effect, we conduct a controlled perturbation study using the trained cascade models. In this study, we modify the OFD masks before feeding them to 3D U-Net.

Concretely, we start from the OFD masks predicted by the single-task 3D U-Net and apply 2D morphological dilation or erosion with square kernels of size

k \times k

to every daily mask. A fixed land mask is applied so that the perturbed fronts never expand into invalid land pixels. The perturbed OFD masks are then supplied to the already trained cascade models, and we re-evaluate OFETR on the test set without retraining. Dilation is used to emulate systematic over-detection, where low-gradient regions and nontarget fronts are gradually included. Erosion is used to emulate systematic under-detection, where parts of the true frontal zone are removed. Across five random seeds, dilating with

k = 3

increases the number of positive pixels by about 50% relative to the original prediction,

k = 5

by about 90%,

k = 7

by about 120%, and

k = 9

by about 145%. Eroding with kernels

k = 2, 3, 4, 5

removes roughly 26%, 48%, 64%, and 75% of positive pixels, respectively, relative to the original OFD masks.

Figure 14a summarizes how these dilations affect OFETR accuracy. Relative to the baseline that uses unmodified OFD masks, moving to

k = 3

and

k = 5

produces the largest decreases, whereas the additional loss between

k = 5

and

k = 7

and between

k = 7

and

k = 9

is smaller. Averaged over all attachment points, the mean accuracy typically decreases by about 2–4 pp for

k = 3

and

k = 5

and by roughly 4–6 pp for

k = 7

and

k = 9

. Deeper attachments are more tolerant to dilation. For example, at d2, the accuracy loss remains within about 1–2 pp across all tested kernels, whereas shallow encoder attachments such as e1 and e2 may lose around 3–6 pp over the same kernel range. Figure 14b shows that F1 follows the same pattern as accuracy, with substantial degradation already at

k = 3

and

k = 5

and larger kernels further reducing performance but with smaller incremental changes.

For erosion, Figure 14c shows that small kernels

k = 2

and

k = 3

cause only modest accuracy reductions compared with the baseline using the original OFD masks. On average, the mean accuracy drops by about 1–3 pp in this range. In contrast, stronger erosion leads to severe degradation. At

k = 4

, the mean accuracy decreases by roughly 7–9 pp, and at

k = 5

, the decrease reaches about 13–15 pp across attachment points. Figure 14d shows a similar pattern for F1, with small kernels causing only limited declines and larger kernels producing large losses. No attachment point remains robust under the strongest erosions: once

k \geq 4

, all curves fall far below the baseline, indicating that severe under-detection in OFD severely limits the ability of OFETR to recover the true trend even with temporal context.

Overall, in the cascade setting, both false positives and false negatives in the OFD masks have a clear impact on OFETR performance. In our experiments, dilations with

k = 7

or

k = 9

and erosions with

k = 4

or

k = 5

correspond to severe distortions of the OFD masks, in which the number of frontal pixels is roughly doubled or reduced to about one quarter relative to the original predictions and the IoU with the original mask drops from about 0.93 to roughly 0.39 (dilation) or 0.25 (erosion). Such extreme cases are unlikely to arise in typical OFD deployments, so they should be interpreted as lower bounds on OFETR performance under strong error propagation from the OFD stage.

4. Discussion

This study aims to mitigate two limitations of existing pipelines for OFETR. First, we replace cascaded pipelines that first perform OFD and then apply OFETR to the OFD outputs with an end-to-end multi-task 3D U-Net in which OFD and OFETR share 3D spatiotemporal representations. Second, we predict daily point-level trends rather than window-level summaries. By sharing features between tasks, the model becomes less sensitive to upstream detection errors and enables both tasks to exploit common spatiotemporal information in a mutually consistent manner.

To support systematic evaluation, we constructed the ZFCFM dataset for OFD and the ZFCFET dataset for OFETR. Both datasets focus on the Zhejiang–Fujian Coastal Front, which appears on most days and typically maintains relatively high intensity. We compute gradient-magnitude maps using a filled Sobel operator applied to the SST fields, apply a fixed min–max normalization of gradient magnitude, and impose a lower bound of 3 °C/100 km when defining masks. These choices concentrate positive labels in the high-gradient tail and produce a clear target band, and therefore the overall difficulty of the current datasets is moderate. On this benchmark, we compare several 2D CNN OFD baselines, the CCAIM line-based detector, the ETR cascade model, and a family of single-task and multi-task 3D U-Net variants.

The single-task 3D U-Net achieves OFD performance close to strong 2D baselines while using fewer parameters and a similar daily compute budget. Its temporal encoding leads to more consistent front selection across consecutive days in multi-front scenes. In Figure 9, for example, DCENet sometimes treats a nearby nontarget front as the main structure and produces sudden large false positive regions between days. In contrast, 3D U-Net-S avoids such abrupt changes and instead moves the target mask more smoothly along the evolving frontal zone. This behavior can still misplace the boundaries between neighboring fronts on some dates. Even so, it more closely matches the way human analysts trace the same front over time rather than treating each day as an independent static scene. In settings that require detecting multiple ocean fronts simultaneously, different frontal zones can interact in complex ways. Temporal information may therefore play an even more important role in separating and tracking them over time.

Attaching an OFETR head to the shared backbone generally improves OFETR Accuracy and stabilizes training. The effect on OFD depends on where the head is attached. Deeper decoder attachments yield larger OFETR gains at the cost of small to moderate declines in OFD. Shallower attachments preserve OFD more faithfully but bring only milder OFETR improvements. A single shared backbone also simplifies implementation and reduces model size. Under the multi-task setting, the attachment position therefore governs the balance between tasks. Attachments at e1 and e2 largely preserve OFD but only modestly improve OFETR. Attachments at d2 and d3 tend to show stronger gradient interference that reduces OFETR and further suppresses OFD. Attachments at e4 and d1 offer a reasonable compromise on our datasets, and e4 in particular often improves OFETR with limited impact on OFD.

Ablation studies clarify how architectural decisions influence this balance. Clip length affects the two tasks in different ways. OFD prefers shorter clips that emphasize local spatiotemporal consistency and sharp edge contrast. OFETR prefers a medium clip that captures trends over several days without accumulating long-range drift. A clip length of six days provides a robust trade-off between accuracy and daily computation for both tasks.

The choice of 3D building block sets the compute–accuracy frontier. Factorized P3D blocks roughly halve the daily FLOPs and substantially reduce parameters compared with full 3D convolutions while only slightly lowering OFD metrics. For OFETR, the effect is mixed but often positive, especially at some attachment points where P3D matches or outperforms C3D despite the lower cost. This suggests that moderate factorization in time and space is an effective way to control complexity without sacrificing much accuracy.

We also examine how different input representations influence performance. On ZFCFM and ZFCFET, we compare single-channel grayscale gradients with two color-mapped inputs based on jet and viridis. Because we apply fixed min–max normalization and a constant mask threshold before any color mapping, these encodings behave as simple deterministic transforms from gradient magnitude to RGB. For OFD, the differences among grayscale, jet, and viridis are small at all attachment points and typically remain within a few tenths of a percentage point. This confirms that the network can recover boundary information reliably from either the scalar field or the color-mapped representation. For OFETR, the choice of encoding has a larger effect. Color-mapped inputs generally outperform grayscale by about one to two percentage points in Accuracy and F1, and viridis tends to be slightly better than jet in single-task settings, whereas jet retains a small advantage in some multi-task models. Overall, the experiments indicate that using a well-behaved color map can modestly benefit trend recognition, while the gap between different color maps is modest compared with the gap between color and grayscale. Future work could compare a broader range of color-mapping schemes for both OFD and OFETR. For example, two-stage mappings that reserve additional contrast for extreme values may help compensate for the loss of detail at the tails.

Finally, the cascade experiments highlight how OFD quality constrains OFETR when the two stages are decoupled. On our current datasets, the cascade 3D U-Net models achieve the highest OFETR accuracy and F1 in Table 1 because the underlying OFD masks are already reliable. At the same time, the morphological perturbation study in Figure 14 shows that small changes to the OFD masks can have a marked effect on trend prediction. Moderately dilating the masks to simulate extra false positives or eroding them to simulate missed detections leads to several percentage points of accuracy and F1 loss, especially for stronger perturbations. This pattern suggests that, in the cascade setting, the OFETR heads rely heavily on the binary OFD input and use the raw gradient fields only in a secondary way. As a result, they function more as post-processors of OFD than as independent detectors.

It is also important to note that our perturbations use idealized, spatially uniform dilation and erosion. Real OFD errors are often more heterogeneous and more tightly linked to the presence of multiple physical fronts in the same SST scene. In our setting, each ZFCFM or ZFCFET sample is annotated with a single target front, yet the underlying SST fields frequently contain several neighboring frontal zones. As illustrated in Figure 9h,i, DCENet can expand the mask so that it partly covers a nearby nontarget front while missing part of the annotated target band. In such a multi-front detection setting, a cascade OFETR model that depends strongly on OFD may inherit both large false positive regions on the nontarget front and large false negatives on the true front. Our simple morphological perturbations do not capture the full diversity of these structured errors. They still suggest that, when OFD conflates target and nontarget fronts in multi-front scenes, error propagation through the cascade can be more complex and potentially more severe than what is implied by uniform pixel-wise perturbations. These considerations further motivate continued exploration of multi-task learning approaches that reduce the explicit dependence of OFETR on binary OFD masks.

5. Conclusions

We employ a shared 3D U-Net to jointly learn OFD and point-level OFETR. Both tasks are trained against a common spatiotemporal representation, which alleviates error amplification in cascaded pipelines, reduces reliance on handcrafted compensation rules, and improves data efficiency. By supervising trend recognition at the point level, we preserve the full distribution of front intensity in both the loss and the predictions and avoid averaging alternating strengthening and weakening into a single window-level label.

On the ZFCFM and ZFCFET datasets, the single-task 3D U-Net delivers OFD performance close to strong 2D baselines while using far fewer parameters and a comparable daily compute budget. Its mIoU is within about 0.4 pp of U-Net++ and DCENet, yet it uses roughly one quarter of the parameters of U-Net++ and well below one fifth of DCENet, and its daily FLOPs lie between those two models. Multi-task learning further improves OFETR relative to single-task heads. For example, attaching OFETR at e4 raises Accuracy/F1 from roughly 84.2%/80.8% to about 86.0%/83.3%, and attachment at d1 gives smaller but consistent gains. The cascade configuration attains the highest OFETR Accuracy and F1 when high-quality OFD masks are available, but at the cost of stronger dependence on segmentation errors.

Ablation experiments clarify how design choices influence this balance. Clip length affects the two tasks differently. Shorter clips (

T = 4

) slightly improve OFD, whereas medium clips (

T = 6

) align better with OFETR and offer a good trade-off between accuracy and daily computation, so we adopt

T = 6

as the default. Factorized P3D blocks trade a small mIoU drop for roughly half the parameters and daily FLOPs compared with full C3D blocks and often yield comparable or slightly better OFETR, which makes them a cost-effective option. Regarding input encodings, using jet or viridis color maps instead of a single-channel grayscale gradient brings only marginal changes to OFD. For OFETR, however, both color encodings provide a modest improvement of about 1–2 pp over grayscale, especially in single-task settings, and the performance gap between jet and viridis is generally within a few tenths of a percentage point. Finally, morphological perturbation experiments on the cascade models show that dilating or eroding the OFD masks can lead to substantial OFETR drops when a large fraction of frontal pixels is added or removed, whereas multi-task OFETR heads are less directly tied to the mask and are therefore less sensitive to such perturbations.

Despite these advantages, several limitations remain. First, although the single-task 3D U-Net produces temporally smoother and more conservative OFD predictions than 2D baselines, the impact of this behavior in fully multi-front detection scenarios, where several fronts are segmented simultaneously, remains to be verified. Second, the current OFETR formulation uses binary trend labels, even though ZFCFET is constructed by differencing ZFCFM-based daily intensities. Small mask errors, for example, those introduced by the filled Sobel operator, can shift the estimated daily change

Δ s (d)

across the decision boundary and induce label noise. In this setting, we consider direct regression less attractive than a refined classification that discretizes the magnitude of

Δ s (d)

into multiple levels, which may be more robust to moderate mask discrepancies. Third, our experiments focus on a single coastal front system at daily resolution, so cross-region, cross-resolution, and cross-season generalization has not yet been evaluated. Fourth, the present multi-task architecture is deliberately simple. We vary attachment positions to ease potential conflicts at deeper layers, but we do not explore more sophisticated mechanisms for mitigating negative transfer between OFD and OFETR. A noticeable performance gap also remains between multi-task OFETR and the optimistic cascade upper bound, so there is still scope to strengthen the trend branch without sacrificing segmentation.

Future work will extend OFETR beyond binary trends to multi-level labels based on the magnitude of

Δ s (d)

in order to capture finer-grained evolution. We also plan to move from single-front supervision to multi-target settings that explicitly handle several fronts within the same scene, where temporal cues may help disentangle interacting frontal zones. In addition, we aim to further optimize the multi-task framework itself, including the sharing scheme between the two tasks and strategies for combining the losses

L_{OFD}

and

L_{OFETR}

, so that OFD and OFETR can benefit more fully from each other while keeping error propagation under control.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/rs17233862/s1, Dataset S1: ZFCFM and ZFCFET datasets for daily ocean-front detection (OFD) and ocean-front evolution trend recognition (OFETR) over the Zhejiang–Fujian Coastal Front from 2002 to 2021. The archive contains front masks organized by year in the ZFCFM directory, yearly JSON files with point-level trend labels in the ZFCFET directory.

Author Contributions

Conceptualization, Q.H. and A.H.; methodology, A.H.; software, A.H.; validation, A.H.; formal analysis, A.H.; investigation, A.H.; resources, Q.H.; data curation, A.H.; writing—original draft preparation, A.H.; writing—review and editing, Q.H., A.H., W.Z., L.G. and Y.D.; visualization, A.H.; supervision, Q.H.; project administration, Q.H., W.Z., L.G. and Y.D.; funding acquisition, Q.H., W.Z., L.G. and Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 42376194) and by the Project of AI-driven Scientific Research for the Development of Disciplines by the Shanghai Education Commission.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The ESA SST CCI Level-4 Analysis product v3.0 can be accessed at https://dx.doi.org/10.5285/4a9654136a7148e39b7feb56f8bb02d2, accessed on 20 November 2025. The mesoscale front dataset constructed with CCAIM and used in this study is available at https://doi.org/10.5281/zenodo.14373832, accessed on 20 November 2025. The ZFCFM and ZFCFET datasets created in this study are provided as Supplementary Dataset S1 accompanying this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

McWilliams, J.C. Oceanic Frontogenesis. Annu. Rev. Mar. Sci. 2021, 13, 227–253. [Google Scholar] [CrossRef] [PubMed]
Horner-Devine, A.R.; Hetland, R.D.; MacDonald, D.G. Mixing and Transport in Coastal River Plumes. Annu. Rev. Fluid Mech. 2015, 47, 569–594. [Google Scholar] [CrossRef]
Walker, N.D.; Wiseman, W.J.; Rouse, L.; Babin, A. Effects of River Discharge, Wind Stress, and Slope Eddies on Circulation and the Satellite-Observed Structure of the Mississippi River Plume. J. Coast. Res. 2005, 21, 1228–1244. [Google Scholar] [CrossRef]
Ou, H.W.; Dong, C.M.; Chen, D. Tidal Diffusivity: A Mechanism for Frontogenesis. J. Phys. Oceanogr. 2003, 33, 840–847. [Google Scholar] [CrossRef]
Chen, C.; Xue, P.; Ding, P.; Beardsley, R.C.; Xu, Q.; Mao, X.; Gao, G.; Qi, J.; Li, C.; Lin, H.; et al. Physical Mechanisms for the Offshore Detachment of the Changjiang Diluted Water in the East China Sea. J. Geophys. Res. Ocean. 2008, 113, C02002. [Google Scholar] [CrossRef]
Amos, C.M.; Castelao, R.M. Influence of the El Niño–Southern Oscillation on SST Fronts Along the West Coasts of North and South America. J. Geophys. Res. Ocean. 2022, 127, e2022JC018479. [Google Scholar] [CrossRef]
O’Neill, L.W.; Chelton, D.B.; Esbensen, S.K. Observations of SST-Induced Perturbations of the Wind Stress Field over the Southern Ocean on Seasonal Timescales. J. Clim. 2003, 16, 2340–2354. [Google Scholar] [CrossRef]
Liu, Y.; Meng, Z.; Chen, W.; Liang, Y.; Chen, W.; Chen, Y. Ocean Fronts and Their Acoustic Effects: A Review. J. Mar. Sci. Eng. 2022, 10, 2021. [Google Scholar] [CrossRef]
Yao, Y.; Zhong, W.; He, H.; Sun, Y.; Feng, Z. The relationship between the North Pacific midlatitude oceanic frontal intensity and the storm track and its future changes. In Proceedings of the One Ocean Science Congress 2025, Nice, France, 3–6 June 2025. OOS2025-241. [Google Scholar] [CrossRef]
Lévy, M.; Haëck, C.; Mangolte, I.; Cassianides, A.; El Hourany, R. Shift in phytoplankton community composition over fronts. Commun. Earth Environ. 2025, 6, 591. [Google Scholar] [CrossRef]
Hsu, T.Y.; Chang, Y.; Lee, M.A.; Wu, R.F.; Hsiao, S.C. Predicting Skipjack Tuna Fishing Grounds in the Western and Central Pacific Ocean Based on High-Spatial-Temporal-Resolution Satellite Data. Remote Sens. 2021, 13, 861. [Google Scholar] [CrossRef]
Coadou-Chaventon, S.; Speich, S.; Zhang, D.; Rocha, C.B.; Swart, S. Oceanic Fronts Driven by the Amazon Freshwater Plume and Their Thermohaline Compensation at the Submesoscale. J. Geophys. Res. Ocean. 2024, 129, e2024JC021326. [Google Scholar] [CrossRef]
Zhu, R.; Yu, J.; Zhang, X.; Yang, H.; Ma, X. Air–Sea Interaction During Ocean Frontal Passage: A Case Study from the Northern South China Sea. Remote Sens. 2025, 17, 3024. [Google Scholar] [CrossRef]
Cronin, M.F.; Zhang, D.; Wills, S.M.; Reeves Eyre, J.E.J.; Thompson, L.; Anderson, N. Diurnal warming rectification in the tropical Pacific linked to sea surface temperature front. Nat. Geosci. 2024, 17, 316–322. [Google Scholar] [CrossRef]
Castelao, R.M.; Barth, J.A.; Mavor, T.P. Flow-topography interactions in the northern California Current System observed from geostationary satellite data. Geophys. Res. Lett. 2005, 32, L24612. [Google Scholar] [CrossRef]
Castelao, R.M.; Wang, Y. Wind-Driven Variability in Sea Surface Temperature Front Distribution in the California Current System. J. Geophys. Res. Ocean. 2014, 119, 1861–1875. [Google Scholar] [CrossRef]
Cayula, J.F.; Cornillon, P. Edge Detection Algorithm for SST Images. J. Atmos. Ocean. Technol. 1992, 9, 67–80. [Google Scholar] [CrossRef]
Belkin, I.M.; O’Reilly, J.E. An algorithm for oceanic front detection in chlorophyll and SST satellite imagery. J. Mar. Syst. 2009, 78, 319–326. [Google Scholar] [CrossRef]
Xing, Q.; Yu, H.; Wang, H.; Ito, S.i. An improved algorithm for detecting mesoscale ocean fronts from satellite observations: Detailed mapping of persistent fronts around the China Seas and their long-term trends. Remote Sens. Environ. 2023, 294, 113627. [Google Scholar] [CrossRef]
Wang, Y.; Zhou, F.; Meng, Q.; Zhou, M.; Hu, Z.; Zhang, C.; Zhao, T. An ocean front detection and tracking algorithm. arXiv 2025, arXiv:2502.15250. [Google Scholar] [CrossRef]
He, Q.; Gong, B.; Song, W.; Du, Y.; Zhao, D.; Zhang, W. DCENet: A Dense Contextual Ensemble Network for Multiclass Ocean Front Detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, D.; Zhang, X. A New Method for Ocean Fronts’ Identification With Res-U-Net and Remotely Sensed Data in the Northwestern Pacific Area. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–11. [Google Scholar] [CrossRef]
Wan, X.; Zhang, L.; Ma, X.; Xu, W.; Chen, Q.; Zhao, R.; Zeng, M. Dynamic gradient orientation and multi-scale fusion network for ocean front detection. J. Sea Res. 2025, 206, 102601. [Google Scholar] [CrossRef]
Xing, Q.; Yu, H.; Wang, H. Global mapping and evolution of persistent fronts in Large Marine Ecosystems over the past 40 years. Nat. Commun. 2024, 15, 4090. [Google Scholar] [CrossRef]
Yang, K.; Meyer, A.; Strutton, P.G.; Fischer, A.M. Global trends of fronts and chlorophyll in a warming ocean. Commun. Earth Environ. 2023, 4, 489. [Google Scholar] [CrossRef]
Yang, Y.; Lam, K.M.; Sun, X.; Dong, J.; Lguensat, R. An Efficient Algorithm for Ocean-Front Evolution Trend Recognition. Remote Sens. 2022, 14, 259. [Google Scholar] [CrossRef]
Embury, O.; Merchant, C.J.; Good, S.A.; Rayner, N.A.; Høyer, J.L.; Atkinson, C.; Block, T.; Alerskans, E.; Pearson, K.J.; Worsfold, M.; et al. Satellite-based time-series of sea-surface temperature since 1980 for climate applications. Sci. Data 2024, 11, 326. [Google Scholar] [CrossRef] [PubMed]
Cao, W.; Xie, C.; Han, B.; Dong, J. Automatic Fine Recognition of Ocean Front Fused with Deep Learning. Comput. Eng. 2020, 46, 266–274. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Knowl. Data Eng. 2022, 34, 5586–5609. [Google Scholar] [CrossRef]
Rafiq, G.; Rafiq, M.; Choi, G.S. Video Description: A Comprehensive Survey of Deep Learning Approaches. Artif. Intell. Rev. 2023, 56, 13293–13372. [Google Scholar] [CrossRef]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2016, Athens, Greece, 17–21 October 2016; Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2016; Volume 9901, pp. 424–432. [Google Scholar] [CrossRef]
Xie, B.; Qi, J.; Yang, S.; Sun, G.; Feng, Z.; Yin, B.; Wang, W. Sea Surface Temperature and Marine Heat Wave Predictions in the South China Sea: A 3D U-Net Deep Learning Model Integrating Multi-Source Data. Atmosphere 2024, 15, 86. [Google Scholar] [CrossRef]
Hu, B.; Gao, B.; Woo, W.L.; Ruan, L.; Jin, J.; Yang, Y.; Yu, Y. A Lightweight Spatial and Temporal Multi-Feature Fusion Network for Defect Detection. IEEE Trans. Image Process. 2021, 30, 472–486. [Google Scholar] [CrossRef]
Qiu, Z.; Yao, T.; Mei, T. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5534–5542. [Google Scholar] [CrossRef]
Xing, Q.; Yu, H.; Yu, W.; Chen, X.; Wang, H. A global daily mesoscale front dataset from satellite observations: In situ validation and cross-dataset comparison. Earth Syst. Sci. Data 2025, 17, 2831–2848. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Granada, Spain, 20 September 2018; Stoyanov, D., Taylor, Z., Carneiro, G., Syeda-Mahmood, T., Martel, A., Maier-Hein, L., Tavares, J.M.R., Bradley, A., Papa, J.P., Belagiannis, V., et al., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. [Google Scholar] [CrossRef]
Crameri, F.; Shephard, G.E.; Heron, P.J. The misuse of colour in science communication. Nat. Commun. 2020, 11, 5444. [Google Scholar] [CrossRef]

Figure 1. Panels (a–f) present a short-term example of ocean-front evolution trend over the Zhejiang–Fujian coastal region during 20 to 25 January 2002. The subtitle below each panel indicates the daily strength of the Zhejiang–Fujian Coastal Front, which is defined as the mean sea surface temperature (SST) gradient with units °C/100 km. In our definition of the evolution trend, from (a–d) the front exhibits a strengthening trend, and from (d–f) it exhibits a weakening trend. Panels are ordered chronologically within the stated period.

Figure 2. Empirical cumulative distribution functions (ECDF) for the Zhejiang–Fujian coastal region during 2002–2021. (a) ECDF of gradient magnitude over all valid ocean pixels with the vertical axis transformed by the 10th power. (b) ECDF of gradient magnitude restricted to pixels inside the manually annotated front masks with the vertical axis transformed by the 5th power. (c) ECDF of the absolute day-to-day change

| Δ s (d) |

.

Figure 2. Empirical cumulative distribution functions (ECDF) for the Zhejiang–Fujian coastal region during 2002–2021. (a) ECDF of gradient magnitude over all valid ocean pixels with the vertical axis transformed by the 10th power. (b) ECDF of gradient magnitude restricted to pixels inside the manually annotated front masks with the vertical axis transformed by the 5th power. (c) ECDF of the absolute day-to-day change

| Δ s (d) |

.

Figure 3. Monthly statistics of the Zhejiang–Fujian Coastal Front for 2002–2021 (

n = 20

years). (a) Coverage from the Zhejiang–Fujian Coastal Front Mask (ZFCFM) dataset, computed as the fraction of ocean-only pixels. (b) Strengthening ratio from the Zhejiang–Fujian Coastal Front Evolutionary Trend (ZFCFET) dataset. Solid lines show cross-year means of year–month values. Shaded bands indicate

95 %

confidence intervals across years based on a t distribution and reflect interannual variability.

Figure 3. Monthly statistics of the Zhejiang–Fujian Coastal Front for 2002–2021 (

n = 20

years). (a) Coverage from the Zhejiang–Fujian Coastal Front Mask (ZFCFM) dataset, computed as the fraction of ocean-only pixels. (b) Strengthening ratio from the Zhejiang–Fujian Coastal Front Evolutionary Trend (ZFCFET) dataset. Solid lines show cross-year means of year–month values. Shaded bands indicate

95 %

confidence intervals across years based on a t distribution and reflect interannual variability.

Figure 4. Fraction of days that require manual confirmation after Algorithm 1 under different choices of the minimum run length

L_{min}

and the absolute threshold

τ

.

Figure 4. Fraction of days that require manual confirmation after Algorithm 1 under different choices of the minimum run length

L_{min}

and the absolute threshold

τ

.

Figure 5. Example of manual refinement of trend labels for a short unresolved segment in the ZFCFET dataset. Panels (a–g) display daily SST gradient heatmaps for 27 October–2 November 2004, with ZFCFM front masks outlined in white. Algorithm 1 initially assigns weakening labels to (b,c) and strengthening labels to (f,g), while (d,e) remain unresolved. Based on the visual criteria described in the text, (c) is identified as the turning day and (d,e) are subsequently labeled as strengthening.

Figure 6. Segment-wise trends of monthly Zhejiang–Fujian Coastal Front intensity split at a Pettitt change point around February 2012 (

K = 6242

,

p = 7.65 \times 10^{- 9}

). (a) 2002–2011: Theil–Sen slope −0.81 °C/100 km per decade, with a relative change of

- 1.42 %

per decade, not significant (Mann–Kendall

p = 0.067

). (b) 2012–2021: Theil–Sen slope +0.194 °C/100 km per decade, with a relative change of

+ 3.45 %

per decade, significant (Mann–Kendall

p = 3.05 \times 10^{- 4}

).

Figure 6. Segment-wise trends of monthly Zhejiang–Fujian Coastal Front intensity split at a Pettitt change point around February 2012 (

K = 6242

,

p = 7.65 \times 10^{- 9}

). (a) 2002–2011: Theil–Sen slope −0.81 °C/100 km per decade, with a relative change of

- 1.42 %

per decade, not significant (Mann–Kendall

p = 0.067

). (b) 2012–2021: Theil–Sen slope +0.194 °C/100 km per decade, with a relative change of

+ 3.45 %

per decade, significant (Mann–Kendall

p = 3.05 \times 10^{- 4}

).

Figure 7. Overall network architecture. The OFETR head is attached at the first decoder. (a–d) illustrate the Pseudo-3D (P3D) block, the OFETR head that outputs point-level results, the encoder, and the decoder.

Figure 8. Validation performance comparison across different models. Panels (a–c) show OFD with mIoU at attachment positions e3, e4, and d1, respectively. Panels (d–f) show OFETR with AUROC at the same positions. In each panel, the blue line denotes a strong single-task reference model (DCENet for OFD; single-task 3D U-Net-S (d2) for OFETR), the orange line denotes the single-task 3D U-Net, and the green line denotes the multi-task 3D U-Net. Shaded bands indicate 95% confidence intervals (

n = 5

).

Figure 8. Validation performance comparison across different models. Panels (a–c) show OFD with mIoU at attachment positions e3, e4, and d1, respectively. Panels (d–f) show OFETR with AUROC at the same positions. In each panel, the blue line denotes a strong single-task reference model (DCENet for OFD; single-task 3D U-Net-S (d2) for OFETR), the orange line denotes the single-task 3D U-Net, and the green line denotes the multi-task 3D U-Net. Shaded bands indicate 95% confidence intervals (

n = 5

).

Figure 9. Six-day visual comparison from 4 to 9 April 2018. Panels (a–f) show the reference gradient-magnitude heatmaps. Panels (g–l) present DCENet predictions, and panels (m–r) present 3D U-Net-S predictions for the same dates in the same left-to-right order. Subfigure titles in the prediction rows report the per-day IoU in percent. Color coding for prediction overlays is as follows: black denotes the land mask, green denotes true positives where prediction matches the label, orange denotes false positives predicted by the model, blue denotes regions that are positive in the ground-truth label but predicted negative by the model, and white denotes background. Columns are aligned by date across rows.

Figure 10. Six-day feature visualization from 1 to 6 January 2018. Panels (a–f) show the cls-block input feature-norm maps for the single-task 3D U-Net-S (e4) model, and panels (g–l) show the corresponding maps for the multi-task variant with the OFETR head attached at the e4 layer.

Figure 11. Monthly performance heatmaps. Panels (a–d) show monthly mean values for (a) OFD mIoU, (b) OFD mF1, (c) OFETR accuracy, and (d) OFETR F1 across all evaluated models. Each cell corresponds to one combination of model and month. Higher values are shown with lighter colors.

Figure 12. Validation performance comparison across different clip lengths for multi-task 3D U-Net. Panels (a–c) show OFD mIoU at attachment points e3, e4, and d1. Panels (d–f) show OFETR AUROC at the same attachment points. In each panel, the blue line denotes

T = 4

, the orange line denotes

T = 6

, and the green line denotes

T = 8

. Shaded bands indicate 95% confidence intervals (

n = 5

).

Figure 12. Validation performance comparison across different clip lengths for multi-task 3D U-Net. Panels (a–c) show OFD mIoU at attachment points e3, e4, and d1. Panels (d–f) show OFETR AUROC at the same attachment points. In each panel, the blue line denotes

T = 4

, the orange line denotes

T = 6

, and the green line denotes

T = 8

. Shaded bands indicate 95% confidence intervals (

n = 5

).

Figure 13. Validation performance comparison between P3D and C3D blocks. Panels (a–c) show OFD with mIoU at attachment positions e3, e4, and d1. Panels (d–f) show OFETR with AUROC at the same positions. In each panel, the blue curve denotes P3D and the orange curve denotes C3D. Shaded bands indicate 95% confidence intervals (

n = 5

).

Figure 13. Validation performance comparison between P3D and C3D blocks. Panels (a–c) show OFD with mIoU at attachment positions e3, e4, and d1. Panels (d–f) show OFETR with AUROC at the same positions. In each panel, the blue curve denotes P3D and the orange curve denotes C3D. Shaded bands indicate 95% confidence intervals (

n = 5

).

Figure 14. Effect of morphological perturbations on cascade OFETR performance. Each panel shows mean Accuracy or F1 across five random seeds for the cascade 3D U-Net model, where the OFETR head is attached at different positions (e1–e4 for encoder stages and d1–d3 for decoder stages). Panels (a,b) report the impact of dilation with kernel sizes

k \in {3, 5, 7, 9}

, while panels (c,d) report the impact of erosion with

k \in {2, 3, 4, 5}

. In all panels, the horizontal black dashed line denotes the cascade baseline that uses the unmodified OFD masks predicted by the single-task 3D U-Net, and each marker corresponds to the mean performance at a given k for a fixed attachment point.

Figure 14. Effect of morphological perturbations on cascade OFETR performance. Each panel shows mean Accuracy or F1 across five random seeds for the cascade 3D U-Net model, where the OFETR head is attached at different positions (e1–e4 for encoder stages and d1–d3 for decoder stages). Panels (a,b) report the impact of dilation with kernel sizes

k \in {3, 5, 7, 9}

, while panels (c,d) report the impact of erosion with

k \in {2, 3, 4, 5}

. In all panels, the horizontal black dashed line denotes the cascade baseline that uses the unmodified OFD masks predicted by the single-task 3D U-Net, and each marker corresponds to the mean performance at a given k for a fixed attachment point.

Table 1. Model performance comparison. For 3D U-Net variants, the suffixes -S, -M, and -C denote the single-task, multi-task, and cascade settings, respectively. Other models are trained only for the corresponding task. Attachment-point notation: parentheses use e for encoder and d for decoder, and the trailing number indicates the stage index (e.g., e3 attaches after the third encoder). Floating-point operations (FLOPs) are reported as FLOPs(G) per single forward pass. Bold indicates the best result in Columns 2–5.

Model	OFD (Mean ± SD%)		OFETR (Mean ± SD%)		Params (M)	FLOPs (G)
Model	mIoU	mF1	Acc	F1	Params (M)	FLOPs (G)
CCAIM	46.84	62.54	–	–	–	–
U-Net	92.84 ± 0.11	96.13 ± 0.06	–	–	31.04	18.84
U-Net++	92.78 ± 0.19	96.11 ± 0.11	–	–	36.63	53.97
DCENet	92.93 ± 0.04	96.18 ± 0.02	–	–	65.65	77.69
Res-U-Net	92.11 ± 0.32	95.73 ± 0.18	–	–	8.04	15.17
3D U-Net-S	92.55 ± 0.27	95.99 ± 0.16	–	–	9.08	159.64
ETR	–	–	69.04 ± 1.54	59.43 ± 4.05	21.79	6.40
3D U-Net-S(e1)	–	–	82.07 ± 0.49	78.44 ± 0.71	0.07	11.45
3D U-Net-S(e2)	–	–	82.88 ± 0.57	79.31 ± 0.94	0.36	18.43
3D U-Net-S(e3)	–	–	84.09 ± 0.40	80.36 ± 0.83	1.52	25.39
3D U-Net-S(e4)	–	–	84.19 ± 0.69	80.83 ± 1.02	6.13	32.32
3D U-Net-S(d1)	–	–	85.15 ± 0.15	82.16 ± 0.52	8.40	63.22
3D U-Net-S(d2)	–	–	85.68 ± 0.84	82.92 ± 1.45	8.97	94.19
3D U-Net-S(d3)	–	–	84.82 ± 0.75	82.00 ± 0.75	9.12	125.3
3D U-Net-M(e1)	92.59 ± 0.13	96.02 ± 0.08	82.01 ± 0.56	78.43 ± 0.87	9.12	125.31
3D U-Net-M(e2)	92.35 ± 0.11	95.88 ± 0.07	83.45 ± 0.64	80.16 ± 1.00	9.22	125.27
3D U-Net-M(e3)	91.98 ± 0.40	95.68 ± 0.23	85.35 ± 0.53	82.26 ± 0.89	9.66	125.26
3D U-Net-M(e4)	91.98 ± 0.29	95.67 ± 0.16	85.99 ± 0.31	83.32 ± 0.41	11.38	125.25
3D U-Net-M(d1)	91.87 ± 0.18	95.62 ± 0.10	85.40 ± 0.76	82.55 ± 1.50	9.66	125.26
3D U-Net-M(d2)	91.49 ± 0.54	95.39 ± 0.31	84.09 ± 0.80	81.09 ± 1.40	9.22	125.27
3D U-Net-M(d3)	91.08 ± 0.88	95.18 ± 0.49	84.20 ± 1.09	81.08 ± 1.35	9.12	125.31
3D U-Net-C(e1)	–	–	87.60 ± 0.46	85.05 ± 0.71	9.15	131.23
3D U-Net-C(e2)	–	–	88.37 ± 0.26	86.16 ± 0.38	9.44	138.21
3D U-Net-C(e3)	–	–	87.94 ± 0.58	85.45 ± 0.65	10.60	145.16
3D U-Net-C(e4)	–	–	87.58 ± 0.79	85.28 ± 0.89	15.21	152.10
3D U-Net-C(d1)	–	–	87.79 ± 0.59	85.55 ± 0.72	17.48	183.00
3D U-Net-C(d2)	–	–	87.48 ± 0.92	85.19 ± 0.91	18.05	213.97
3D U-Net-C(d3)	–	–	86.67 ± 1.20	84.21 ± 1.84	18.20	245.08

Table 2. Clip-length comparison for single-task and multi-task 3D U-Net variants. For each model configuration, the best value across clip lengths

T \in {4, 6, 8}

is underlined. Task-wise best scores in Columns 3–6 are set in bold. Daily FLOPs gives the computational cost (G/day) required to produce contiguous point-level outputs.

Table 2. Clip-length comparison for single-task and multi-task 3D U-Net variants. For each model configuration, the best value across clip lengths

T \in {4, 6, 8}

is underlined. Task-wise best scores in Columns 3–6 are set in bold. Daily FLOPs gives the computational cost (G/day) required to produce contiguous point-level outputs.

Model	Clip Length	OFD (Mean ± SD %)		OFETR (Mean ± SD %)		Daily FLOPs (G/day)
Model	Clip Length	mIoU	mF1	Accuracy	F1	Daily FLOPs (G/day)
3D U-Net-S	4	92.80 ± 0.09	96.13 ± 0.05	–	–	79.82
	6	92.55 ± 0.27	95.99 ± 0.16	–	–	59.87
	8	92.48 ± 0.14	95.96 ± 0.08	–	–	53.21
3D U-Net-S (e3)	4	–	–	84.52 ± 0.41	81.37 ± 0.53	16.92
	6	–	–	84.09 ± 0.40	80.36 ± 0.83	12.70
	8	–	–	83.06 ± 0.37	79.30 ± 0.23	11.28
3D U-Net-S (e4)	4	–	–	84.46 ± 0.48	81.33 ± 0.80	26.38
	6	–	–	84.19 ± 0.69	80.83 ± 1.02	16.16
	8	–	–	83.72 ± 0.47	80.11 ± 1.43	14.37
3D U-Net-S (d1)	4	–	–	84.80 ± 0.55	82.06 ± 0.89	42.15
	6	–	–	85.15 ± 0.15	82.16 ± 0.52	31.61
	8	–	–	84.29 ± 0.70	81.37 ± 0.95	28.10
3D U-Net-M (e3)	4	92.26 ± 0.25	95.84 ± 0.14	85.30 ± 0.78	82.55 ± 0.75	83.50
	6	91.98 ± 0.40	95.68 ± 0.23	85.35 ± 0.53	82.26 ± 0.89	62.63
	8	91.64 ± 0.57	95.49 ± 0.33	84.70 ± 0.37	81.32 ± 0.39	55.67
3D U-Net-M (e4)	4	92.12 ± 0.18	95.74 ± 0.11	84.82 ± 1.05	82.26 ± 1.00	83.50
	6	91.98 ± 0.29	95.67 ± 0.16	85.99 ± 0.31	83.32 ± 0.41	62.63
	8	91.85 ± 0.19	95.60 ± 0.10	85.83 ± 0.53	83.27 ± 0.74	55.66
3D U-Net-M (d1)	4	91.42 ± 0.94	95.35 ± 0.54	85.40 ± 1.25	82.79 ± 1.72	83.50
	6	91.87 ± 0.18	95.62 ± 0.10	85.40 ± 0.76	82.55 ± 1.50	62.63
	8	91.81 ± 0.28	95.59 ± 0.15	85.45 ± 0.17	83.07 ± 0.19	55.67

Table 3. Comparison between 3D convolution (C3D) and P3D building blocks in single-task and multi-task 3D U-Net variants (clip length

T = 6

). Within each model, the better-performing block is underlined. Task-wise best scores across all configurations in Columns 3–6 are highlighted in bold. Params denotes the number of trainable parameters, and Daily FLOPs gives the computational cost (G/day) required to produce contiguous point-level outputs.

Table 3. Comparison between 3D convolution (C3D) and P3D building blocks in single-task and multi-task 3D U-Net variants (clip length

T = 6

). Within each model, the better-performing block is underlined. Task-wise best scores across all configurations in Columns 3–6 are highlighted in bold. Params denotes the number of trainable parameters, and Daily FLOPs gives the computational cost (G/day) required to produce contiguous point-level outputs.

Model	Block	OFD (Mean ± SD %)		OFETR (Mean ± SD %)		Params(M)	Daily FLOPs (G/day)
Model	Block	mIoU	mF1	Accuracy	F1	Params(M)	Daily FLOPs (G/day)
3D U-Net-S	C3D	92.88 ± 0.08	96.18 ± 0.05	–	–	17.70	129.42
3D U-Net-S	P3D	92.55 ± 0.27	95.99 ± 0.16	–	–	9.08	59.87
3D U-Net-S (e3)	C3D	–	–	85.00 ± 0.44	81.80 ± 0.93	2.29	20.01
3D U-Net-S (e3)	P3D	–	–	84.09 ± 0.40	80.36 ± 0.83	1.52	12.70
3D U-Net-S (e4)	C3D	–	–	84.49 ± 0.43	81.11 ± 1.02	9.32	21.55
3D U-Net-S (e4)	P3D	–	–	84.19 ± 0.69	80.83 ± 1.02	6.13	16.16
3D U-Net-S (d1)	C3D	–	–	85.61 ± 0.91	82.76 ± 1.21	15.73	61.63
3D U-Net-S (d1)	P3D	–	–	85.15 ± 0.15	82.16 ± 0.52	8.40	31.61
3D U-Net-M (e3)	C3D	91.86 ± 0.55	95.63 ± 0.31	85.15 ± 0.47	82.24 ± 0.57	18.27	132.18
3D U-Net-M (e3)	P3D	91.98 ± 0.40	95.68 ± 0.23	85.35 ± 0.53	82.26 ± 0.89	9.66	62.63
3D U-Net-M (e4)	C3D	91.99 ± 0.32	95.69 ± 0.18	85.11 ± 0.48	82.26 ± 0.67	19.99	132.17
3D U-Net-M (e4)	P3D	91.98 ± 0.29	95.67 ± 0.16	85.99 ± 0.31	83.32 ± 0.41	11.38	62.63
3D U-Net-M (d1)	C3D	92.02 ± 0.31	95.70 ± 0.17	85.17 ± 0.69	82.38 ± 0.73	18.27	132.18
3D U-Net-M (d1)	P3D	91.87 ± 0.18	95.62 ± 0.10	85.40 ± 0.76	82.55 ± 1.50	9.66	62.63

Table 4. Input comparison for single-task and multi-task 3D U-Net variants. Inputs are either single-channel gradient magnitude (Gray) or three-channel color-mapped gradients (Jet and Viridis). For each model, the best value across input modes is underlined. Task-wise best scores in Columns 3–6 are set in bold.

Model	Input	OFD (Mean ± SD %)		OFETR (Mean ± SD %)
Model	Input	mIoU	mF1	Accuracy	F1
3D U-Net-S	Gray	92.67 ± 0.15	96.06 ± 0.09	–	–
	Jet	92.55 ± 0.27	95.99 ± 0.16	–	–
	Viridis	92.53 ± 0.15	95.99 ± 0.08	–	–
3D U-Net-S (e3)	Gray	–	–	83.43 ± 0.54	80.02 ± 0.95
	Jet	–	–	84.09 ± 0.40	80.36 ± 0.83
	Viridis	–	–	84.54 ± 0.65	80.79 ± 1.29
3D U-Net-S (e4)	Gray	–	–	82.56 ± 1.31	78.59 ± 1.59
	Jet	–	–	84.19 ± 0.69	80.83 ± 1.02
	Viridis	–	–	84.44 ± 0.54	81.65 ± 0.70
3D U-Net-S (d1)	Gray	–	–	83.02 ± 0.97	79.47 ± 1.99
	Jet	–	–	85.15 ± 0.15	82.16 ± 0.52
	Viridis	–	–	85.10 ± 0.75	82.51 ± 0.86
3D U-Net-M (e3)	Gray	91.38 ± 0.64	95.35 ± 0.35	84.16 ± 0.55	80.51 ± 1.27
	Jet	91.98 ± 0.40	95.68 ± 0.23	85.35 ± 0.53	82.26 ± 0.89
	Viridis	91.74 ± 0.51	95.55 ± 0.30	85.34 ± 0.61	82.61 ± 0.44
3D U-Net-M (e4)	Gray	92.00 ± 0.20	95.68 ± 0.11	84.34 ± 0.26	81.53 ± 0.41
	Jet	91.98 ± 0.29	95.67 ± 0.16	85.99 ± 0.31	83.32 ± 0.41
	Viridis	91.92 ± 0.26	95.64 ± 0.14	85.39 ± 0.43	82.85 ± 0.68
3D U-Net-M (d1)	Gray	92.21 ± 0.17	95.81 ± 0.09	84.35 ± 0.74	81.79 ± 0.78
	Jet	91.87 ± 0.18	95.62 ± 0.10	85.40 ± 0.76	82.55 ± 1.50
	Viridis	91.47 ± 0.55	95.39 ± 0.30	84.71 ± 0.57	82.24 ± 0.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Q.; Huang, A.; Geng, L.; Zhao, W.; Du, Y. Multi-Task Learning for Ocean-Front Detection and Evolutionary Trend Recognition. Remote Sens. 2025, 17, 3862. https://doi.org/10.3390/rs17233862

AMA Style

He Q, Huang A, Geng L, Zhao W, Du Y. Multi-Task Learning for Ocean-Front Detection and Evolutionary Trend Recognition. Remote Sensing. 2025; 17(23):3862. https://doi.org/10.3390/rs17233862

Chicago/Turabian Style

He, Qi, Anqi Huang, Lijia Geng, Wei Zhao, and Yanling Du. 2025. "Multi-Task Learning for Ocean-Front Detection and Evolutionary Trend Recognition" Remote Sensing 17, no. 23: 3862. https://doi.org/10.3390/rs17233862

APA Style

He, Q., Huang, A., Geng, L., Zhao, W., & Du, Y. (2025). Multi-Task Learning for Ocean-Front Detection and Evolutionary Trend Recognition. Remote Sensing, 17(23), 3862. https://doi.org/10.3390/rs17233862

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Task Learning for Ocean-Front Detection and Evolutionary Trend Recognition

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Zhejiang–Fujian Coastal Front Mask Dataset

2.1.2. Zhejiang–Fujian Coastal Front Evolutionary Trend Dataset

2.2. Method

2.2.1. Network Architecture

2.2.2. Compared Model

2.2.3. Loss

2.3. Metrics

3. Results

3.1. Comparison Across Models

3.2. Effect of Clip Length on Performance

3.3. Effect of the P3D Block on Model Performance

3.4. Effect of Input Representation

3.5. Sensitivity of Cascade Models to OFD Mask Errors

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI