1. Introduction
As a typical nonlinear dynamic system, marine ecosystems are regulated by the synergistic effects of multiple environmental factors—including physical, chemical, and biological processes—and exhibit highly complex spatio-temporal heterogeneity [
1,
2]. With increasing human activities, environmental issues such as offshore eutrophication and water pollution have continued to worsen. Consequently, high-precision and large-scale prediction for key ocean elements (e.g., chlorophyll-a concentration, sea surface temperature) has become critical for marine ecological management and disaster early warning [
3,
4].
Current ocean remote sensing prediction methods primarily follow two paradigms: physics-driven and data-driven approaches [
5,
6]. Physics-driven methods, such as numerical models (e.g., ROMS, FVCOM), simulate ocean parameter evolution by solving simplified fluid dynamics equations derived from the Navier–Stokes equations [
7]. While these models offer high interpretability, they face limitations including complex multi-physics field coupling, parameterization errors, and substantial computational costs, which restrict their practical application. In contrast, data-driven approaches have gained prominence with advances in artificial intelligence and the growing availability of ocean remote sensing data [
8,
9]. Researchers have successfully employed various machine learning models for ocean element prediction, including random forest (RF) [
10], support vector machine (SVM) [
11], and artificial neural network (ANN) [
12,
13]. Among these, convolutional neural networks (CNNs) excel at capturing spatial correlations in gridded data, leveraging convolutional operations to extract multi-scale features from ocean element images [
14], while long short-term memory (LSTM) models demonstrate superior capability in modeling temporal dependencies within sequential data [
15,
16]. To address the need for joint spatio-temporal modeling, hybrid approaches such as CNN-LSTM and ConvLSTM have been developed. CNN-LSTM employs a serial architecture to separately handle spatial and temporal dynamics, though it may suffer from information loss between modules [
17,
18], whereas ConvLSTM integrates convolutional operations into LSTM gating units to simultaneously resolve spatio-temporal interactions, albeit with increased computational complexity [
19,
20]. Recent advancements, such as self-attentive mechanisms (SAM), have further enhanced feature representation by capturing long-range dependencies, though their black-box nature and sensitivity to data quality remain challenges [
21,
22]. Despite these advancements, data-driven models remain limited by their inability to fully explain complex ocean element interactions, ultimately affecting the reliability of the prediction results.
To address the limitations of existing prediction methods, researchers have begun to explore novel modeling approaches that integrate prior knowledge to enhance prediction accuracy while increasing model interpretability [
23]. Among these, knowledge graphs (KGs) have gained increasing attention due to their structured representations and symbolic reasoning capabilities [
24,
25]. KGs store vast amounts of factual knowledge in triplets (head entity, relation, tail entity), effectively establishing domain-specific knowledge systems through interlinked entities and relations. In geospatial information science, scholars have developed specialized geographic knowledge graphs, such as GeoKGs [
26,
27], GIS KGs [
28], flood KGs [
29], UrbanKG [
30], and RSKG [
31]. These knowledge graphs not only model domain-specific knowledge but also support downstream applications by providing structured and interpretable representations. For instance, GeoKGs offer a novel framework for understanding, representing, and mining geoscientific knowledge through the integration of Earth big data, geoscientific knowledge, and models [
26]. UrbanKG demonstrates remarkable performance in urban functional identification by integrating multi-source urban data [
30,
32,
33]. Furthermore, the context-aware knowledge graph (CKG)-based traffic flow prediction model significantly enhances prediction accuracy by capturing the complex relationships of urban spatial and temporal contexts [
34].
However, despite significant advancements in the implementation of KGs across various domains, the integration of domain knowledge in oceanic elemental Earth observations confronts distinct technical challenges. Firstly, in contrast to geographic features, ocean elements manifest intricate nonlinear spatial and temporal evolution patterns [
1,
2]. This necessitates the establishment of ocean KGs capable of capturing the spatial and temporal dependencies of ocean elements, processing interaction patterns, and undergoing dynamic updates. Secondly, the integration of marine domain knowledge with remote sensing data presents significant technical barriers in multimodal feature learning. Current KG-based approaches generally encounter challenges in extracting semantic features from marine knowledge, primarily due to the absence of standardized ontologies and the intricacy of marine ecosystem relationships [
35,
36]. Furthermore, the alignment and fusion of these semantic features with high-dimensional visual features from time-series remote sensing imagery poses significant technical challenges, frequently leading to information loss or feature misalignment [
31]. Thirdly, the integration of fused features into conventional prediction networks is constrained by architectural limitations. The majority of existing spatio-temporal prediction models have not been designed to accommodate graph-structured knowledge, leading to the suboptimal integration of domain knowledge [
37]. This necessitates the development of innovative network architectures capable of effectively leveraging structured knowledge representations and conventional spatio-temporal data models. To our knowledge, no one has yet used knowledge graph techniques to predict typical ocean elements (e.g., chlorophyll-a concentrations).
To address the aforementioned challenges, this study proposes a domain knowledge-guided remote sensing prediction framework for ocean elements (OKG-ConvGRU). The framework consists of four core modules: the ocean elements spatio-temporal knowledge graph (OKG), semantic representation of the knowledge graph, a cross-attention-based multimodal feature fusion module (CAFM), and spatio-temporal prediction with an enhanced ConvGRU network. Specifically, the OKG is first constructed based on the domain knowledge of ocean elements, followed by semantic embedding representations for its spatial and temporal dimensions. Subsequently, CAFM is designed to deeply integrate the semantic features of the OKG with time-series remote sensing image features. Finally, these fused features are integrated into the enhanced ConvGRU network. For long-term prediction, a strategy combining the Seq2Seq architecture with multi-stage rolling prediction is adopted to enhance prediction stability. Compared to traditional knowledge-guided methods, OKG-ConvGRU demonstrates unique advantages in marine element prediction across multiple dimensions. Firstly, it employs structured modeling of the geographic spatial distribution, temporal variations, and influencing mechanisms of marine elements, which can effectively represent the complex interactions among physical, chemical, and biological factors. This significantly improves the explainability of the model. Secondly, the spatio-temporal knowledge of marine elements and time-series remote sensing data are efficiently fused by introducing a cross-attention-based feature fusion module (CAFM). Subsequently, the fused spatio-temporal features are learned using the enhanced ConvGRU network, which significantly improves the prediction accuracy of marine elements and the data utilization efficiency. In addition, by integrating Seq2Seq architecture with multi-stage rolling prediction, OKG-ConvGRU significantly improves the stability of long-term forecasting. The contributions of this study can be summarized in the following three aspects:
- (1)
A spatio-temporal knowledge graph of ocean elements (OKG) is constructed, effectively representing the geospatial distribution characteristics, temporal change patterns, and influence mechanisms of key ocean elements in a structured manner.
- (2)
A domain knowledge-guided remote sensing prediction framework for ocean elements is proposed, which combines the knowledge graph with the ConvGRU network for the first time. Based on the cross-attention mechanism, CAFM effectively fuses the spatio-temporal semantic features in OKG and the visual features of time-series remote sensing images, thereby enhancing the model’s prediction performance for ocean elements.
- (3)
The performance of the OKG-ConvGRU-based chlorophyll-a concentration prediction model is evaluated using the eastern seas of China (Bohai Sea, Yellow Sea, and East China Sea) as an experimental area. The experimental results show that, compared with the baseline model, the proposed model exhibits significant advantages in prediction accuracy, long-term prediction stability, data utilization efficiency, and robustness.
3. Methodology
3.1. Overview
For each type of marine environmental element in the study area, its remote sensing time-series image data can be represented as {|t = 1, 2,⋯, n}, where = (, , ⋯, ), and (i = 1, 2, ⋯, m) denotes the observed values at the i-th spatial location at time t. Here, n represents the length of the time-series, and m represents the total number of pixels in a single image. The purpose of this study is to utilize the sequential image data of the target element and its influencing elements over the past T time steps {, ⋯, , } to predict the images of the target element for the next k time steps {, ⋯, , } by learning their spatio-temporal evolution patterns.
To address the above problems, this paper proposes a domain knowledge-guided remote sensing prediction framework for ocean elements, named OKG-ConvGRU, as shown in
Figure 2. Firstly, we construct an ocean elements spatio-temporal knowledge graph (OKG) and then perform semantic embedding representations of its spatial and temporal dimensions. Subsequently, we design a cross-attention-based feature fusion module (CAFM) to effectively fuse spatio-temporal multimodal features. After that, the fused features are integrated into an enhanced ConvGRU network. Finally, the spatio-temporal multi-step prediction of ocean elements is achieved based on the OKG-ConvGRU framework.
3.2. Construction of Spatio-Temporal Knowledge Graph for Ocean Elements
When processing remote sensing data of ocean elements, considering their close association with spatio-temporal characteristics, we categorize the relationships in the knowledge graph into spatial relationships (entity–relationship–entity), temporal relationships (entity–temporal relationship–time), and attribute relationships (entity–attribute–attribute value). This categorization facilitates comprehensive modeling of the spatial distribution characteristics, temporal variation patterns, and influencing mechanisms of ocean elements from multiple dimensions.
Existing spatio-temporal knowledge graphs mostly focus on urban areas with complex feature types, making them difficult to directly apply to marine scenarios due to significant differences in the elemental characteristics between urban and ocean environments. Therefore, we construct an ocean elements spatio-temporal knowledge graph (OKG) containing a total of 146 triplets, as shown in
Figure 3. A list of all triples is shown in
Table A1 in
Appendix A. This graph encompasses both spatial and temporal dimensions, structurally representing the spatial distribution and temporal variation patterns of key ocean elements (Chl-a, SST, PIC, POC, PAR, NFLH) and related environmental factors within the study area. This provides a foundation for further integration of domain knowledge into the prediction model.
3.2.1. Construction of Knowledge Graph in Spatial Dimension
The spatial dimension of OKG is designed to reveal the spatial distribution patterns of ocean, land, river inlets, and key ocean elements in the remote sensing images of the eastern seas of China, as well as their mutual influence mechanisms; it contains a total of 93 triplets (
Figure 3a). The specific components are as follows:
- (1)
Spatial distribution of sea area and land area
The eastern seas of China include the Bohai Sea, the Yellow Sea, and the East China Sea. First, the latitude and longitude range of each sea area is defined (sea area-latitude/longitude range-degree range), for example, (Bohai Sea, latitude_range_, 37°23′N−41°23′N). Second, the spatial relationship between each sea area and the adjacent land is defined (sea area-spatial relationship-land/sea area) to clearly illustrate the spatial pattern of the sea area and the land.
- (2)
Spatial distribution of estuaries
The river estuary is an important node connecting land and sea, significantly impacting the distribution of ocean elements. In this study, we selected the major river estuaries flowing into the aforementioned sea area, including the mouths of the Yellow River, Liao River, Yalu River, Huai River, Yi River, Yangtze River, Qiantang River, Min River, and Pearl River. For each estuary, we describe its geographic location (estuary-latitude/longitude-degrees), for example, (Yellow River Estuary, latitude, 37°24′N), along with its administrative location (estuary-located in-province/city) and its inflow to the sea (estuary-inflow-sea area).
- (3)
Spatial distribution pattern of ocean elements
For the six major ocean elements observed by remote sensing, we describe their spatial distribution in different sea areas (sea area–ocean elements–characteristics), coastal characteristics (ocean elements–characteristics–coast), offshore gradient changes (ocean elements–change characteristics–offshore), estuarine distribution patterns (estuaries–ocean elements–characteristics), and the impact of estuaries on their distribution (estuaries–effects–ocean elements).
- (4)
Influence mechanism between major ocean elements
There are complex physical, biological, and chemical interactions among ocean elements. To systematically summarize these mechanisms and laws, we extensively collected research data, with Chl-a as the core research element, and sorted out its interactions with other elements (Chl-a–relationship–other elements), as well as the relationships among other elements (other ocean elements–relationship–other elements).
3.2.2. Construction of Knowledge Graph in Temporal Dimension
The long time-series remote sensing images of ocean elements exhibit significant periodic change characteristics. The temporal dimension of OKG is designed to describe this pattern of change over time and contains a total of 53 triplets (
Figure 3b). We labeled each input image with a time indicator, classified by season, spring (March–May), summer (June–August), fall (September–November), and winter (December–February), and described the elements that are sensitive to temporal variations. The specific classifications are as follows:
- (1)
Seasonal change patterns of ocean elements
We analyzed the changing patterns of values and characteristics of each ocean element across the four seasons (ocean elements–seasons–characteristics) to understand their dynamic change throughout the year.
- (2)
Seasonal change rules of ocean currents
Ocean currents are the main driving force for the transportation and mixing of ocean elements, and their temporal changes are crucial for understanding the cyclic pattern of ocean elements. In this study, we select three ocean currents that play an important role in the eastern seas of China: the Kuroshio Current, the Littoral Current, and the Seasonal Circulation. We describe their different characteristics (current–seasonal–characteristics) and changes in their area of influence (current–seasonal–area of influence) over time.
- (3)
Mechanisms of ocean currents affecting ocean elements
Seasonal changes in ocean currents drive variations in ocean temperature, salinity, and nutrients, which in turn lead to cyclical patterns of ocean elements. This study details this influence mechanism (ocean currents–influence mechanism–ocean elements) and analyzes the temporal correlation between ocean currents and ocean elements in depth.
3.3. Semantic Embedding Representation of Knowledge Graphs
To effectively integrate domain knowledge from knowledge graphs into deep learning-based prediction models, this study adopts a knowledge graph embedding technique. Knowledge graph embedding projects the symbolic representation of a knowledge graph onto a low dimensional vector space, achieving a numerical representation of its semantic information. This allows entities and relationships with similar semantics to be closer together in the vector space, providing a foundation for downstream knowledge-guided machine learning tasks.
Inspired by the phenomenon that word vectors are translation invariant in semantic space, Bordes et al. (2013) [
49] proposed the classical representation learning model TransE. For a triple (h, r, t), the model assumes that the vector representation of the head entity h plus the vector representation of the relation r should be equal to the vector representation of the tail entity t:
By minimizing the distance error of the triples in the embedding space, TransE learns the vector representations of entities and relations, and thus effectively predicts the missing links in the knowledge graph. The advantages of TransE lie in its computational efficiency, simplicity of implementation, and its excellent performance in dealing with simple relations. However, TransE’s uniform treatment of embeddings for entities and relations leads to suboptimal performance when dealing with complex relations such as one-to-many, many-to-one, and many-to-many relations.
To address this limitation, an enhanced representation learning model, TransH (Wang et al. 2014) [
50], has been proposed. This model introduces the concept of a hyperplane, which posits that each relation can be represented by a hyperplane on which translation operations are performed. Specifically, for a triple (h, r, t), TransH first projects the head entity h and the tail entity t onto the hyperplane corresponding to the relation r. Then, it performs a translation operation between the two projection vectors.
where
,
are the projection vectors of the head entity h and the tail entity t on the relation r hyperplane. In this way, TransH is able to complex relationships more effectively; however, this improvement also increases the number of parameters and the computational cost of the model, leading to relatively slow training and inference.
The inputs to the semantic embedding representation module in the framework are all the triples in the established OKG to obtain the semantic feature vectors of the ocean elements in the spatial and temporal dimensions, which cover the spatio-temporal features of the six ocean elements (Chl-a, SST, PIC, POC, PAR, NFLH) in the four seasons. Since the knowledge graph constructed in this paper contains relatively simple types of relationships in time and space dimensions, the translation assumption of TransE is more suitable for dealing with such simple relationship patterns, while its efficient computational performance and easy implementation characteristics are more in line with the research needs. Therefore, in this paper, TransE is chosen as the embedding model, while TransH is used as the reference model in the evaluation to verify the applicability and advantages of TransE in this task.
To visualize the effect of knowledge graph embedding, we use the T-distributed stochastic neighborhood embedding (T-SNE) dimensionality reduction method to visualize the distribution of entities and relations in spatial and temporal dimensions in semantic space, as shown in
Figure 4.
3.4. Multimodal Feature Fusion Based on Cross-Attention Mechanism
The cross-attention mechanism is a variant of the attention mechanism. Its core idea is to compute the similarity between the query vector and the key vector, i.e., the attention scores, by using the feature vectors of one modality as the query vector (Query) and the feature vectors of another modality as the key vector (Key) and the value vector (Value). The value vectors are then weighted and aggregated based on these values to generate a new feature representation [
51]. In recent years, the cross-attention mechanism has shown great potential in the field of feature fusion and has been successfully applied to tasks such as image–sentence matching [
52], image fusion [
53], and multispectral target detection [
54], with notable results.
In this phase, our objective is to fuse the semantic feature vectors represented by the knowledge graph embedding with the image feature vectors extracted by the ConvGRU encoder, so that the image features can be adjusted and optimized with targeted guidance from domain knowledge and ultimately generate a fused feature vector suitable for the ConvGRU decoder. For this purpose, we designed a cross-attention fusion module (CAFM), which consists of a pair of multimodal information fusion modules (MIFMs) and a spatio-temporal information integration module (SIIM), as shown in
Figure 5. Specifically, the semantic feature vectors of spatial and temporal dimensions are separately fused with the image feature vectors in MIFM. Subsequently, the two fused features generated are further integrated in SIIM to obtain the final fused feature that includes both temporal and spatial characteristics. The process of feature fusion can be formalized as follows:
where
represents the image feature vector before fusion,
,
represents the spatial and temporal semantic feature vectors, and
represents the fused image feature vector, whose magnitude is the same as that of
.
Furthermore, before performing MIFM, a learnable nonlinear embedding module (NEM) is designed to reduce modal differences in input features, as shown in
Figure 6. NEM projects the semantic feature representations of the knowledge graph onto a space shared with the visual features of the image to achieve their semantic alignment. It consists of two fully connected (FC) layers and a Gaussian error linear unit (GELU) activation function [
32]. This architecture significantly improves the performance of the model in handling complex multimodal data by enhancing its nonlinear capability.
3.4.1. Multimodal Information Fusion Module (MIFM)
In this module, we optimize image features based on the multi-head cross-attention mechanism and domain knowledge contained in the semantic vectors, ultimately obtaining adjusted visual–semantic fused features, as shown in
Figure 7. The process can be represented as follows:
where
represent visual–semantic fused feature vectors in spatial and temporal dimensions, respectively. Specifically, semantic features are used as query vectors and image features are used as key vectors and value vectors. The realization steps are as follows:
First, each modal feature is divided into multiple parts, and multiple query, key, and value vectors are generated by linear transformation:
where
h represents the number of heads;
are the weight matrices of the query, key, and value of the h-th head, respectively; and
are the vector matrices of the query, key, and value of the h-th header, respectively.
Subsequently, the dot product similarity between the query vector and the key vector is computed separately for each head. Then, the SoftMax operation is applied, and a weighted summation is performed with the corresponding value vector to derive the attention scores for each head, as expressed by the following formula:
Finally, the outputs of all the heads are concatenated. Then, the feature dimensions of the outputs are made consistent with the inputs by linearly varying the formula as follows:
where
is a linear transform weight matrix with uniform feature dimensions. Equation (5) follows similar steps as Equation (4), with the difference that the spatial semantic feature vectors are replaced by temporal semantic feature vectors, where the former are fixed as inputs during the model training phase, while the latter are continuously adapted for fusion of the corresponding features based on the corresponding time points of the images.
3.4.2. Spatio-Temporal Information Integration Module (SIIM)
The purpose of this module is to integrate the multimodal fusion features obtained from the previous two MIFMs, so as to output integrated feature information in both temporal and spatial dimensions. As shown in
Figure 8, the process can be represented as follows:
To associate important feature information in the temporal dimension with that in the spatial dimension, we use temporally fused features as query vectors and spatially fused features as key vectors and value vectors and input them into SIIM for cross-attention-based integration.
Firstly, similar to the previous steps, the query vector of
and the key and value vectors of
and
are transformed linearly into
, respectively. Then, a dot product attention layer is employed to calculate the similarity matrix between
and
. This is followed by
softmax operation and weighted summation with
to obtain the interaction information between
and
. The process can be represented as follows:
Finally, in order to enable the output features to serve as inputs for the subsequent ConvGRU decoder, a linear transformation is applied to the integrated result as follows:
where
is a linearly transformed weight matrix, which ensures that
and
have the same dimension.
3.5. Enhanced ConvGRU Network
ConvGRU, an advanced variant of the Gated Recurrent Unit (GRU), is specifically designed to handle spatio-temporal data by integrating convolutional operations into its gating mechanisms, thereby replacing conventional matrix multiplications. This architecture comprises three fundamental components: the reset gate, update gate, and candidate activation. The reset gate governs the extent to which historical information is discarded, the update gate modulates the assimilation of new information, and the candidate activation produces a provisional state based on the current input and the reset gate’s output. Collectively, these components enable the network to effectively capture temporal dependencies while preserving the spatial integrity of the data, making ConvGRU particularly adept at modeling time-series image data of ocean elements, which inherently exhibit spatio-temporal dependencies [
55]. In comparison to ConvLSTM, ConvGRU offers a more streamlined architecture by eliminating the output gate, resulting in a reduction in the number of parameters, accelerated training speeds, and diminished sample size requirements. These attributes render ConvGRU a more computationally efficient and resource-effective solution for the dataset employed in this study.
However, ordinary ConvGRUs have limitations in handling hierarchical and multi-scale spatio-temporal features, thus failing to effectively encode and decode the spatio-temporal information in fused features. To address these limitations, we design an enhanced ConvGRU network as shown in
Figure 9. In the encoder component of this enhanced network, a hierarchical three-layer architecture is employed. Each layer integrates a down-sampling convolutional layer and a ConvGRU cell. The down-sampling convolutional layer systematically reduces the spatial resolution of the input images through 2D convolutional operations, while simultaneously capturing localized spatial features. The ConvGRU cell utilizes a gating mechanism to model temporal dependencies and extract multi-scale spatio-temporal features, thus encoding the image sequences into high-dimensional feature representations. Specifically, the convolutional operations within ConvGRU preserve spatial features by capturing local patterns, while the gating mechanism (update and reset gates) dynamically controls the flow of information to retain temporal dependencies. This ensures that both spatial and temporal features are effectively integrated and represented in the encoded feature space.
In the decoder component, a symmetrical three-layer structure is adopted, with each layer consisting of an up-sampling layer and a ConvGRU cell. The up-sampling layer progressively restores the spatial resolution of the feature maps through interpolation or transposed convolution, ensuring the preservation of fine-grained details. The ConvGRU cell further refines the integration of spatio-temporal features to maintain temporal coherence in the predictive outputs. The final layer incorporates an inverse convolutional operation to enhance the precision of image details, resulting in high-resolution predictions. This encoder–decoder framework facilitates the effective extraction and reconstruction of spatio-temporal features of ocean elements, providing robust and discriminative feature representations for subsequent predictive modeling tasks.
3.6. Spatio-Temporal Multi-Step Prediction of Ocean Elements Based on OKG-ConvGRU
Traditional multi-step rolling prediction iteratively updates the dataset by adding the single-step prediction results to the end of the input sequence to achieve multi-step prediction [
56,
57]. This method is simple and easy to implement, with low computational complexity; however, it suffers from the cumulative error problem, where the prediction error gradually accumulates during the iteration process, resulting in decreased accuracy as the prediction steps increase. Additionally, this method relies solely on local information and may overlook global time dependencies.
To address these issues, this study introduces the Seq2Seq (Sequence-to-Sequence) architecture [
58,
59] and combines it with the concept of multi-step rolling prediction, proposing a multi-step prediction method that leverages the advantages of both approaches. The Seq2Seq architecture consists of two components: the encoder and the decoder. The encoder encodes the input sequences into a fixed-length vector, while the decoder generates the output sequences based on this vector. In the model design, both the encoder and decoder adopt a four-layer OKG-ConvGRU stacking structure to effectively extract spatio-temporal features, as shown in
Figure 10.
During the training phase, the model learns the mapping relationship from the input sequence to the output sequence. When the number of prediction steps exceeds the decoder’s step size, the multi-step rolling prediction approach is adopted, where the first time step data output from the decoder is fed back to the encoder as a new input sequence for iterative prediction. This approach not only retains Seq2Seq architecture’s ability to model global dependencies but also enables flexible long-time sequence prediction through multi-step rolling, while reducing computational complexity. This method effectively addresses the cumulative error and local information dependence issues of traditional methods, enhancing the flexibility and adaptability of the model while ensuring prediction accuracy.
4. Experimental Results
4.1. Dataset Processing
4.1.1. Data Preprocessing
In this part, we performed several preprocessing operations on the original satellite images to improve the data quality and make them better adapt to the subsequent spatio-temporal prediction. To address the issue of missing values in original images, the data interpolation empirical orthogonal function (DINEOF) method [
60,
61] was utilized to reconstruct the missing image data. This method effectively restores the missing values and retains the spatio-temporal variation characteristics of the data through spatio-temporal covariance matrix decomposition and iterative interpolation. Subsequently, high-precision land vector data corresponding to the selected projection was employed to implement a masking process for the land anomalies of the ocean water color data, thereby eliminating geographic interference. We then processed the data for outliers, replacing negative values with 0 and using the Winsorization method to replace pixel values that exceeded the upper limit with the upper limit. To unify the dimensions of the multi-source data, the parameters were normalized to the [0, 1] interval by min–max normalization [
62]. Finally, the images were uniformly cropped to 320 × 568 pixel specifications to fit the model inputs.
The dataset division strictly followed the principle of temporal continuity, and the 262 months of data from August 2002 to May 2024 (2002.08–2024.05) were divided into three subsets: the training set (2002.08–2018.05, 90 months) is used for model parameter learning, the validation set (2018.06–2021.05, 36 months) is used for hyperparameter optimization, and the test set (2021.06–2024.05, 36 months) is used to evaluate the model generalization ability.
4.1.2. Correlation Analysis
In order to verify the reasonableness of the input variables of our model, the nonparametric statistical method was used for multivariate correlation analysis. By calculating the Spearman correlation coefficient, the monotonic correlations between Chl-a and the other five marine environmental variables (i.e., SST, PIC, POC, PAR, NFLH) were quantified.
As demonstrated in
Figure 11, the thermogram constructed based on Spearman’s rank correlation coefficient (R) reveals the pattern of correlation between Chl-a and the other ocean elements. The intensity of the color scale is positively correlated with
. The results show that there were significant correlations between Chl-a and all five environmental variables. Specifically, the strongest and most significant negative correlation (R = −0.792,
p < 0.001) is observed between Chl-a and SST, indicating that elevated water temperatures exert an inhibitory effect on algal metabolism. Conversely, positive correlations are identified between Chl-a and the other variables (PAR, POC, PIC, and NFLH), which reflect the positive effects of light and organic matter on phytoplankton growth. In addition, a significant cross-correlation is identified among the variables. For instance, a strong positive correlation is observed between PAR and SST (R = 0.863,
p < 0.001), suggesting a synergistic effect between solar radiation and surface seawater thermodynamic processes. This multi-dimensional correlation network confirmed the ecological coupling of the input variables.
4.1.3. Construction of Time-Series Slicing Sample Set
This study used a sequence of ocean element images from a continuous period of time in the past as input to predict their values in the future. To this end, we adopted a sliding window method along the timeline to slice the preprocessed time-series images and construct a sample dataset. As shown in
Figure 12, each sample data consists of T time-series images, where the first T/2 images were used as inputs, and the observed values of the following T/2 images were used as their corresponding labels (i.e., predicted values).
Through extensive experiments and comparative analysis, we find that when the time-series slice length T is set to 10 months and the input and output sequence lengths are both set to 5 months, the prediction model can achieve excellent performance. When the predicted future time step k is greater than 5 months, adopting a multi-step rolling prediction strategy can achieve longer predictions. Specifically, using sequence image data of six ocean elements from the past five months, it is possible to predict Chl-a images for multiple time steps (k > five months) in the future.
4.2. Evaluation Metrics
In this study, three evaluation metrics were employed to quantitatively assess the overall prediction performance of the model: mean absolute error (MAE), root mean square error (RMSE), and goodness of fit (
). A lower MAE and RMSE indicate higher model prediction accuracy.
is used to evaluate the degree of fit between the predicted values and observed values of the model, with values ranging from 0 to 1. When close to 1, the goodness of fit is high, indicating that the observed values are close to the expected values of the model, i.e., the difference between the model’s predictions and the actual observations is small. Conversely, when close to 0, the model’s predictions differ significantly from actual observations. The units of MAE and RMSE are the same as those of the predicted target element (Chl-a), mg/
. These indicators are defined as follows:
where
N is the total number of samples,
is the actual observed value,
is the model’s predicted value, and
is the average of the true values of all samples.
To evaluate the performance of the embedding model, we used three metrics: Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hits at K (Hits@K). MR indicates the average position of the correct entity in the ranking of all predictions, with lower values indicating more accurate model predictions; MRR measures the average of the inverse of the rankings of the correct entities, which is more focused on whether the correct answer appears in the top ranks, and higher values are better; Hits@K reflects the proportion of correct entities included in the top K predictions and is used to assess the accuracy of the model, especially when K is small, and higher values indicate better model performance. Note that for a more accurate evaluation, we removed the real triples in training and calculated the filtered rankings to obtain the filtered metrics above. The specific formula is as follows:
where
T denotes the test set,
rank(t) denotes the rank position of the correct entity (or relation) t in a particular query, and
K represents that only the top K positions are considered when evaluating the prediction results.
4.3. Model Implementation Details
4.3.1. Experimental Environment
The experiments were conducted on a workstation that was equipped with an Intel Core i7-14650HX processor and operated on the Windows 11 operating system. The model was implemented based on the PyTorch framework (version 1.12) and utilized an NVIDIA RTX 4070 graphics card (16 GB video memory) for the purpose of training acceleration, with CUDA version 12.5. Code development and debugging were conducted in the PyCharm (version 2024.1.1) integrated development environment.
4.3.2. Model Settings
The model employed the mean square error (MSELoss) as the loss function, and the optimizer selected Adam’s algorithm, whose learning rate was set to an initial value of 0.001, with a sampling variation rate of 0.00002 and the total number of training rounds set to 25,000. To ensure reproducibility and eliminate potential bias, the random seed was fixed throughout the training process. To address the challenge of computational resources posed by the increase in model complexity, this study employed a block-based training strategy. The input data was segmented into n sub-blocks (n is the number of blocks) along the channel dimensions. These sub-blocks were processed separately during the training process, and ultimately, the output results were integrated into the original image size at the prediction stage.
4.4. Knowledge Graph Embedding Evaluation
For the constructed OKG (which includes spatial and temporal dimensions), we used TransE and TransH models for embedding, respectively. In the training process of knowledge graph embedding models, the selection of hyperparameters significantly impacts model performance. To identify the optimal combination of hyperparameters, we employ a Bayesian optimization strategy to optimize the parameter configurations for both the TransE and TransH models. The objective function is defined as the Mean Rank (MR), aiming to minimize this metric. The hyperparameters to be optimized include the embedding dimension of entities and relations (ranging from 50 to 200, with a step size of 10), the margin parameter in the loss function (ranging from 0.5 to 2.0, with a step size of 0.1), the weight of the soft constraint (ranging from 0.01 to 0.5, with a step size of 0.01), the learning rate (ranging from 0.001 to 0.1, with a step size of 0.001), and the number of negative samples per positive sample (ranging from 1 to 50, with a step size of 1). In the Bayesian optimization process, we first construct a probabilistic model of the objective function using Gaussian Process (GP) and Expected Improvement (EI) as the acquisition function to guide the search in the parameter space. Through iterative evaluations, the objective function is assessed in each iteration, and the Gaussian Process model is updated accordingly, gradually approaching the global optimum until the objective function converges. The results show that TransE significantly outperformed TransH across all evaluation metrics under the optimal configuration (see
Table 2). This result may be attributed to the fact that the types of relationships in the constructed knowledge graphs are relatively simple, and the TransE’s translation assumption (h + r ≈ t) is more suitable for handling such simple relationships. In contrast, TransH introduces relation-specific hyperplanes to model complex relations, but in this scenario, this complexity may be redundant and instead increase the model complexity and training difficulty.
During the Bayesian optimization process, we observed that when both models achieved their optimal configurations (with the lowest MRR), TransE required significantly fewer negative samples, training epochs, batch sizes, and embedding dimensions compared to TransH. This difference stems from TransE’s simpler structure, which has lower dependence on training resources and data while maintaining high performance even under resource constraints. In contrast, TransH’s more complex architecture demands greater resources to prevent overfitting.
To further analyze the embedding model’s performance, we visualized the relationships among the head entity, relation, and tail entity by randomly sampling triples and projecting them into a two-dimensional space using t-SNE (
Figure 13). This was conducted for both spatial and temporal dimensions.
Comparing
Figure 13a,b with
Figure 13c,d, the following phenomena were observed: (1) In TransE, the entity-to-relationship distance is larger, indicating that the model relies more on the relationship vectors to convey information rather than on entity similarity. Additionally, the distance from the head entity plus the relationship to the tail entity is smaller, which verifies the high efficiency of its tail entity prediction. (2) In contrast, in TransH, the head and tail entities after projection are concentrated in the pre-projections around the entities with smaller distances, indicating that the projection operation has limited effects on the entity vectors and the relation hyperplane fails to adequately capture the semantic changes in the entities. The larger distance from the projected head entity plus the relation to the tail entity indicates that the relation vector fails to effectively model the semantic transformation, reflecting its inadequacy in capturing the ternary structure information. (3) The ternary distribution in TransE is more uniform, balancing the representation of entities and relations, whereas the distribution is more concentrated in TransH. The projection operation restricts the diversity of expression, resulting in the failure of the semantic information to unfold adequately.
The above analysis further confirms the significant advantages of TransE in embedding spatial and temporal knowledge graphs. To quantify this conclusion, we substituted each of the two embedding methods into the overall prediction model and calculated their
, MAE, and RMSE at the T + 1 time step during the test phase, as shown in
Figure 14.
The experimental results show that the prediction performance of OKG-ConvGRU1 is significantly better than that of OKG-ConvGRU2, indicating that the quality of knowledge graph embedding has an important impact on the overall prediction accuracy of the model. Based on these results, we used TransE to embed spatial and temporal knowledge graphs in subsequent model applications to ensure the efficiency and accuracy of the model in prediction tasks.
4.5. Comparison of Predictive Performance with Benchmark Models
To comprehensively evaluate the performance of the proposed OKG-ConvGRU model in spatio-temporal oceanic element prediction, this study conducted a systematic experimental evaluation of Chl-a prediction. The evaluation was based on the test dataset of the study area spanning the period from June 2021 to May 2024, totaling 36 months. In the experimental design, the OKG-ConvGRU model was first compared with the single data-driven CA-ConvGRU (ConvGRU combined with a cross-attention mechanism) model. To further validate the model’s superiority, we selected five types of current mainstream deep learning-based spatio-temporal prediction models as benchmark models, including the GRU model, the CNN-LSTM hybrid model, the ConvLSTM model, the ConvGRU model, and the SA-ConvLSTM (ConvLSTM with a self-attention module) model. In addition, we also used a simple forecasting model based on meteorology as a baseline model. This model is used as a predictor for the corresponding month in the test set by calculating the historical average of the chlorophyll-a concentration data for the same month in the training and validation sets.
To ensure the scientific validity and reliability of the experimental results, all models were tested under the same experimental environment and dataset, using the same evaluation metrics.
Table 3 shows the best values of each evaluation metric for Chl-a prediction by each model at five prediction time steps from T + 1 to T + 5, providing a reliable basis for quantitative comparison of model performance.
Through the comparative analysis of the values in
Table 3, it can be observed that our proposed OKG-ConvGRU model significantly outperforms other models in future multi-step prediction. Specifically, the MAE and RMSE values of OKG-ConvGRU stabilize within 0.210 and 0.630, respectively, with minimum values reaching 0.202 and 0.617, respectively, which are the lowest values among all models. Meanwhile, its
value remains stable above 0.9971, with a maximum value of 0.9974, further verifying the superiority of OKG-ConvGRU in terms of fitting effect and prediction accuracy.
Compared to the single-modal CA-ConvGRU, OKG-ConvGRU significantly improves all the metrics in multi-step prediction, which fully demonstrates the effectiveness of incorporating prior knowledge from the knowledge graph into the prediction model. Additionally, when comparing and analyzing the benchmark models, we find that the prediction accuracies of the CA-ConvGRU, SA-ConvLSTM, ConvGRU, CNN-LSTM, ConvLSTM, and GRU models decrease in that order. The Climatological Mean Prediction model has the worst performance. This may be due to the fact that the higher the model complexity and ability to capture spatio-temporal features, the higher the prediction accuracy, whereas the climate mean prediction model relies only on historical averages, which are not able to capture complex dynamic changes. Among them, CA-ConvGRU further enhances the prediction accuracy by introducing the cross-attention mechanism compared to the SA-ConvLSTM model, which incorporates the self-attention module, indicating a stronger ability to capture spatio-temporally dependent information. Meanwhile, ConvGRU is more streamlined compared to ConvLSTM, effectively alleviating the model complexity problem caused by the introduction of the attention mechanism. To assess the statistical significance of differences in predictive performance among different models, this study employed a one-way analysis of variance (One-way ANOVA) to test the significance of the mean absolute error (MAE) metrics across seven models. As illustrated in
Figure 15, the test results indicate that, except for the differences between ConvGRU and ConvLSTM, as well as CNN-LSTM, which did not pass the significance test (
p > 0.05) likely due to their similar architectural characteristics and feature extraction mechanisms, the differences in MAE among all other models were statistically significant (
p < 0.05). Notably, the differences in MAE between the proposed OKG-ConvGRU model and all baseline models reached an extremely significant level (
p < 0.001), demonstrating that, at a 99% confidence level, the predictive performance of OKG-ConvGRU is significantly superior to that of the other baseline models.
Further analysis of the performance of each model in multi-step prediction reveals that the prediction effectiveness of the ConvGRU, ConvLSTM, and CNN-LSTM models decreases significantly as the number of prediction steps increases. This error accumulation phenomenon mainly stems from the fact that each prediction step relies on the output of the previous step, resulting in an amplifying error. In contrast, the CA-ConvGRU, SA-ConvLSTM, and OKG-ConvGRU models exhibit different characteristics; their errors do not increase significantly with the number of prediction steps but instead show a fluctuating and relatively stable trend. This phenomenon may be attributed to the introduction of the attention mechanism, which allows the models to learn deeper patterns and long-term dependencies in long time-series. It is worth noting that the OKG-ConvGRU model demonstrates the strongest metric stability at all time steps, indicating that the prior knowledge in the knowledge graph effectively aligns the periodic spatio-temporal variation characteristics of the input images. This alignment helps the model to more accurately capture and maintain the intrinsic patterns and dynamic characteristics present in the sequence data.
To visually assess differences in model prediction performance across regions, we plotted the February 2024 (T + 1 time step) Chl-a concentration prediction results of multiple models in the study area, as shown in
Figure 16. We also plotted the distribution of prediction errors at this time point, as shown in
Figure 17. Comparing the prediction results and error distributions of the different models, we find that OKG-ConvGRU exhibits the best performance in terms of prediction accuracy and detailed feature capture, with its prediction trend highly consistent with the actual values. CA-ConvGRU tends to overestimate the Chl-a concentration along the Bohai Sea coastline, while the predictions from SA-ConvLSTM are relatively accurate in the Bohai Sea region but less reliable in the deep-sea region. This may be due to the fact that these two models rely too heavily on the attention mechanism and fail to fully capture the prevailing spatial and temporal dependencies and distribution patterns.
In addition, the prediction results of ConvGRU, ConvLSTM, CNN-LSTM, and GRU are consistent with the actual values in terms of the overall trend. However, there are significant differences in the details, particularly in localized areas of the deep-sea. This phenomenon may be related to the limitations of these models in capturing complex spatio-temporal dependencies, especially their insufficient ability to model the nonlinear patterns of change in deep-sea regions.
4.6. Long-Term Predictive Performance Evaluation
To further evaluate the performance of OKG-ConvGRU in multi-step prediction, we used the image data from five time steps in the test set (August 2022 to December 2022) as model inputs to predict Chl-a for the next ten time steps (January 2023 to October 2023). The long-term predictions of the model are shown in
Figure 18.
By comparing the prediction results with the observations at different time steps, we find that the model demonstrates excellent performance in short-term prediction, with its prediction results highly consistent with the observations. However, as the number of prediction steps increases, a discrepancy between the predicted and observed values gradually emerges, which is especially significant in the deep-sea region far from the coast. Nevertheless, the model still exhibits good overall prediction performance. In the multi-step prediction task, this study innovatively adopted a hybrid prediction strategy that combines the Seq2Seq architecture with multi-step rolling prediction. To verify the effectiveness of this strategy, we designed a series of controlled experiments. With the premise of ensuring the consistency of the input data, we used both the hybrid prediction strategy and the traditional multi-step rolling prediction method to make predictions. The error metrics (MAE and RMSE) were calculated for different prediction step sizes, as shown in
Figure 19.
Figure 19 shows the long-term prediction performance of the OKG-ConvGRU model under two prediction strategies. The results indicate that the error metrics (MAE and RMSE) increase significantly with longer prediction step lengths when using the multi-step rolling prediction approach, and the growth trend is nearly exponentially distributed, which aligns with the pattern of error accumulation in iterative operations. In contrast, when the Seq2Seq architecture is combined with multi-step rolling prediction, the error metrics still exhibit an overall increasing trend, but the fluctuations are more stable and even decrease at certain time steps. Specifically, the MAE and RMSE of the model are stabilized within 0.225 and 0.658 for predictions made from January to October 2023, indicating that this strategy can effectively improve the accuracy and stability of the model in long-term prediction.
4.7. Model Data Efficiency and Robustness Analysis
To explore the impact of the joint knowledge and data-driven approach on data dependency, we compared the prediction performance of the OKG-ConvGRU model with that of the single data-driven CA-ConvGRU model by progressively scaling down the size of the training set. Specifically, the original training set was divided into multiple subsets at different scales (40%, 60%, 80%, and 100%), and both models were trained on each subset and evaluated for their performance on the same test set, as shown in
Figure 20. To ensure the reproducibility of the experiments, the same hyperparameter settings were used for all experiments.
Figure 20 illustrates the dynamic characteristics of the prediction accuracy for both the OKG-ConvGRU and CA-ConvGRU models as functions of training data volume. The experimental results demonstrate that (1) under data reduction condition, the OKG-ConvGRU model exhibits a 28.5% slower increase in MAE compared to CA-ConvGRU, with this advantage becoming more pronounced as data volume decreases, indicating superior data robustness; (2) when trained on the full dataset (100% training data), OKG-ConvGRU achieves a 33.7% lower MAE value than CA-ConvGRU, indicating higher predictive precision; and (3) to attain equivalent prediction performance (MAE ≤ 0.25), OKG-ConvGRU requires 24.3% less training data than CA-ConvGRU, significantly enhancing data utilization efficiency. These findings validate the efficacy of integrating knowledge-driven and data-driven strategies in remote sensing spatio-temporal prediction tasks, particularly benefiting scenarios with limited oceanic remote sensing data availability.
5. Conclusions
In this study, we propose a domain knowledge-guided remote sensing prediction framework for ocean elements, which integrates the constructed spatio-temporal knowledge graph of ocean elements and the ConvGRU network. The framework’s CAFM employs a cross-attention mechanism to deeply integrate visual and semantic features for ocean elements. This fusion enables the model to leverage both domain knowledge and time-series remote sensing imagery, thereby effectively capturing nonlinear spatial and temporal dependencies among ocean elements. Experimental validation using monthly remote sensing image data of ocean elements in the eastern seas of China (Bohai Sea, Yellow Sea, and East China Sea) demonstrates the OKG-ConvGRU model’s significant advantages over existing benchmarks in prediction accuracy, data utilization efficiency, and long-term prediction stability. The primary research conclusions are as follows:
- (1)
Joint Knowledge–Data Paradigm: This work introduces the first joint knowledge-and-data-driven remote sensing prediction method for ocean elements, effectively coupling a knowledge graph with a ConvGRU model. The results show that compared to purely data-driven models, this framework not only captures nonlinear spatio-temporal patterns of ocean elements but also elucidates complex inter-element influence mechanisms. By compensating for information gaps that are difficult to infer directly from data, the model achieves marked improvements in prediction accuracy and long-term stability.
- (2)
Data Efficiency and Robustness: The hybrid knowledge–data approach reduces the model’s reliance on large datasets while enhancing its efficiency and robustness. By incorporating knowledge-based constraints and guidance, the model learns accurate patterns from smaller data volumes, mitigating dependency on extensive datasets.
- (3)
Knowledge Graph Embedding: The semantic representation of the knowledge graph critically influences prediction performance. Specifically, the TransE model demonstrates superior embedding effectiveness compared to TransH for ocean element knowledge graphs, yielding significant overall performance enhancements.
- (4)
Multi-Step Prediction Strategy: In multi-step forecasting, combining Seq2Seq architecture with multi-step rolling prediction effectively suppresses error accumulation across extended prediction horizons compared to conventional rolling methods.
Although the OKG-ConvGRU framework demonstrates significant performance advantages in marine element prediction, its knowledge embedding mechanism has, to some extent, increased the model complexity and computational overhead. Meanwhile, limited by the size of existing knowledge graphs, the framework currently focuses on spatio-temporal prediction research of typical marine elements (such as chlorophyll-a concentration and sea surface temperature) in the eastern China seas, including the Bohai Sea, East China Sea, and Yellow Sea. Based on the current research status, subsequent work will emphasize the following two aspects: first, optimizing the efficiency of knowledge embedding to reduce the computational load of the model; second, expanding the coverage of the knowledge graph to enhance the framework’s practicality and regional adaptability, thereby providing technical support for marine element prediction in more extensive sea areas.