Integrating Ecological Semantic Encoding and Distribution-Aligned Loss for Multimodal Forest Ecosystem

Peng, Jing; Fu, Zhengjie; Zhou, Huachen; Liu, Yibin; Zhang, Yang; Shi, Rui; Li, Jiangfeng; Dong, Min

doi:10.3390/f16111697

Open AccessArticle

Integrating Ecological Semantic Encoding and Distribution-Aligned Loss for Multimodal Forest Ecosystem

by

Jing Peng

^1,2,

Zhengjie Fu

^3,4,

Huachen Zhou

^3,4,

Yibin Liu

³,

Yang Zhang

⁴,

Rui Shi

³,

Jiangfeng Li

² and

Min Dong

^3,*

¹

Hubei Key Laboratory of Biological Resources Protection and Utilization, Hubei Minzu University, Enshi 445000, China

²

Department of Land Resource Management, School of Public Administration, China University of Geosciences, Wuhan 430074, China

³

China Agricultural University, Beijing 100083, China

⁴

National School of Development, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

Forests 2025, 16(11), 1697; https://doi.org/10.3390/f16111697

Submission received: 11 October 2025 / Revised: 28 October 2025 / Accepted: 6 November 2025 / Published: 7 November 2025

(This article belongs to the Special Issue Artificial Intelligence and Machine Learning Applications in Forestry—Second Edition)

Download

Browse Figures

Versions Notes

Abstract

In this study, a cross-hierarchical intelligent modeling framework integrating an ecological semantic encoder, a distribution-aligned contrastive loss, and a disturbance-aware attention mechanism was developed to address the semantic alignment challenge between aboveground vegetation and belowground seed banks within forest ecosystems. The proposed framework leverages artificial intelligence and deep learning to characterize the structural and functional coupling between vegetation and soil communities, thereby elucidating the ecological mechanisms that underlie forest regeneration and stability. Experiments using representative forest ecological plot datasets demonstrated that the model achieved a top-1 accuracy of 78.6%, a top-5 accuracy of 89.3%, a mean cosine similarity of 0.784, and a reduced Kullback–Leibler divergence of 0.128, while the Jaccard index increased to 0.512—surpassing traditional statistical and machine-learning baselines such as RDA, CCA, Procrustes, Siamese, and SimCLR. The model also reduced NMDS stress to 0.094 and improved the Sørensen coefficient to 0.713, reflecting high robustness and precision in reconstructing community structure and ecological distributions. Additionally, the integration of distribution alignment and disturbance-aware mechanisms allows the model to capture dynamic vegetation–soil feedbacks across environmental gradients and disturbance regimes. This enables more accurate identification of regeneration potential, resilience thresholds, and restoration trajectories in degraded forests. Overall, the framework provides a novel theoretical foundation and a data-driven pathway for applying artificial intelligence to forest ecosystem monitoring, degradation diagnosis, and adaptive management for sustainable recovery.

Keywords:

forest ecosystem modeling; cross-hierarchical ecological alignment; AI-driven forest restoration assessment; ecological semantic encoder; ecological structure recovery

1. Introduction

Ecosystem renewal and restoration represent a core topic in contemporary ecological research, where aboveground vegetation and the soil seed bank play critical roles [1]. The aboveground vegetation reflects the current composition and structure of the community, while the soil seed bank carries the potential for future community regeneration and succession [2]. The relationship between them is a key determinant of ecosystem stability and resilience and directly influences the provision of ecosystem services such as carbon sequestration, water conservation, soil retention, and biodiversity maintenance [3]. However, extensive empirical studies have frequently demonstrated mismatches and incomplete alignments between the species composition of aboveground vegetation and soil seed banks, influenced by factors including temporal delays, environmental gradients, and histories of anthropogenic disturbance [4]. Scientifically characterizing and modeling these cross-level ecological couplings remain a pressing research challenge [5]. Meanwhile, global land use change continues to intensify its impacts on ecosystem structure and function [6]. Processes such as degradation in agro-pastoral ecotones, overgrazing, urban expansion, and forest-to-farmland conversion not only alter aboveground vegetation communities but also strongly affect the reserves and germination capacity of soil seed banks [7]. Land use and land cover change is considered a primary source of spatial heterogeneity in ecosystem functioning, often leading to weakened or disrupted vegetation–seed bank relationships [8]. This imbalance, in turn, results in instability and degradation in the provision of ecosystem services [9]. Consequently, adopting a cross-level alignment perspective that integrates land use patterns with ecosystem services is of great significance, as it facilitates the identification of underlying mechanistic processes and provides scientific guidance for restoring degraded ecosystems and informing regional management strategies.

In the ecological domain, although some studies have applied deep neural networks to model remote sensing imagery, species distributions, and environmental factors, the introduction of contrastive learning into vegetation–seed bank relationship modeling remains rare [10]. Particularly in the semantic representation [11] of species composition, integration of ecological functional traits, and dynamic characterization of disturbance factors, deep learning offers advantages unmatched by traditional methods. It can capture complex nonlinear relationships in high-dimensional spaces and reveal latent ecological semantics through distributed representations, thus providing new pathways for understanding and predicting ecosystem regeneration and succession [12]. Tang et al. [13] conducted a comparative study applying various machine learning algorithms (e.g., random forest, support vector machines) to predict soil seed bank persistence using environmental variables and seed traits. Their results demonstrated that machine learning significantly outperformed traditional linear models, verifying the advantage of modern algorithms in ecological prediction tasks. Rosbakh et al. [14] employed a random forest approach combining environmental variables, seed traits, and phylogenetic information to predict soil seed bank persistence, achieving much higher predictive accuracy than conventional models, even with only a few easily accessible variables. Khan et al. [15] proposed a forest health classification method combining principal component analysis with supervised learning models (e.g., random forest, support vector machines, decision trees). Using nine ecological indicator variables across 37 sites in the Western Himalayas, their study reported an average accuracy of 90.3% with random forest, significantly outperforming other models. Key drivers identified included diameter at breast height, tree height, and regeneration rates, demonstrating the interpretability and practicality of data-driven methods in ecological health assessment. Luo et al. [16] introduced the FREE framework, which transforms structured environmental features into natural language descriptions and leverages large language models for semantic recognition training. This framework achieved superior performance in stream temperature prediction and maize yield prediction tasks, while also demonstrating high efficiency and generalizability in data-scarce scenarios. Plohák et al. [17] combined extraction and cultivation approaches to analyze soil seed banks, capturing more species than either method alone and achieving stronger similarity with aboveground vegetation.

Nevertheless, these approaches present several limitations. First, they rely heavily on species presence and abundance statistics, lacking the capacity for deeper mechanistic modeling and failing to explain why and how mismatches occur between vegetation and seed banks [18]. Second, they are highly sensitive to noise and sparsity, while ecological field surveys often face limited sample sizes and high proportions of rare species, reducing robustness [19]. Third, these traditional approaches are largely static comparisons, unable to effectively capture the roles of disturbance histories, temporal delays, and environmental gradients [20]. Thus, although such methods provide preliminary insights into vegetation–seed bank relationships, significant bottlenecks remain in their scalability and explanatory capacity. With the rapid advancement of artificial intelligence and big data technologies, deep learning has introduced new possibilities for complex ecosystem modeling [21]. In recent years, techniques such as contrastive learning and multimodal representation learning have achieved breakthroughs in fields such as computer vision and natural language processing. Tasks such as image–text and speech–semantic alignment have demonstrated strong capabilities in semantic representation and cross-modal matching [22]. These approaches construct unified embedding spaces where heterogeneous modalities can be aligned and compared, making them particularly suitable for the naturally multimodal vegetation–seed bank pair in ecology [23].

Against this background, we propose a cross-level ecological alignment framework integrating land use and ecosystem services to overcome the limitations of existing multimodal ecological models. The major contributions are as follows:

Cross-level contrastive learning: We introduce a hierarchical contrastive learning paradigm that unifies aboveground vegetation and soil seed bank representations within a shared embedding space, explicitly addressing their asymmetric ecological dependencies beyond traditional linear (RDA, CCA) or symmetric contrastive models (SimCLR).
Multi-channel ecological semantic encoder: A Transformer-based encoder integrates heterogeneous ecological features—species composition, functional traits, and environmental variables—into interpretable multi-scale embeddings, enabling cross-modal semantic fusion.
Distribution-aligned contrastive loss: We combine InfoNCE with Maximum Mean Discrepancy (MMD) regularization to jointly optimize instance-level discrimination and distribution-level alignment. Although similar strategies exist in domain adaptation and cross-modal learning, this study is the first to extend them to ecological multimodal asymmetry, improving robustness under heterogeneous environmental conditions.
Disturbance-aware attention: Unlike conditional or bias-gated attention used in environmental prediction, our mechanism dynamically reweights pairwise ecological alignments based on learned disturbance embeddings (e.g., grazing, burning, soil disturbance), allowing context-sensitive yet stable semantic coupling under variable disturbances.
Mechanism–pattern–function integration: The framework incorporates land use/land cover (LULC) and ecosystem service indicators, linking ecological mechanisms and spatial patterns with functional outcomes to support large-scale ecosystem assessment and restoration.

2. Related Work

2.1. Modeling the Ecological Relationship Between Aboveground Vegetation and Seed Banks

The ecological correspondence between aboveground vegetation and soil seed banks serves as a crucial window for understanding the coupling mechanisms of ecosystem structure and function [24]. In studies of ecological restoration, community succession, and degradation assessment, this relationship has been widely employed to reveal transitivity, lag effects, and asymmetry between aboveground and belowground ecological compositions [25]. However, existing modeling approaches remain relatively basic, lacking the ability to capture deep semantic structures and nonlinear ecological response mechanisms, which limits their capacity to support fine-grained alignment and prediction tasks in complex ecosystem analysis [26]. In early research, ecologists primarily employed similarity-based measures to quantify the compositional correspondence between aboveground vegetation and soil seed banks [17]. Among these, the most widely used are the Jaccard index and Sørensen similarity coefficient, both derived from binary species–seed presence matrices to calculate the proportion of shared species between communities [27]. Although intuitive, these metrics only measure overlap in species occurrence, ignoring the asymmetric structural differences in abundance, dispersal capacity, and functional traits between vegetation and seed banks [28]. As a result, they fail to reflect the delayed regeneration potential and directional ecological processes (e.g., dormancy and germination filtering) that characterize the aboveground–belowground interface. To incorporate environmental gradients and abundance variation, subsequent studies have adopted redundancy analysis (RDA) and related ordination methods [29]. While RDA can reveal statistical correlations between community composition and environmental factors, it remains inherently linear and assumes symmetric relationships among variables. This assumption breaks down in the context of ecological asymmetry, where vegetation dynamics are shaped by aboveground competition and disturbance regimes, while seed banks are governed by belowground storage, dormancy, and stochastic recruitment processes. Consequently, linear ordination methods fail to model the nonlinear, hierarchical, and distributionally imbalanced interactions that govern the coupling between vegetation and seed bank communities.

In contrast, the proposed approach moves beyond pairwise similarity and linear ordination by employing a multimodal Transformer-based semantic encoder and distribution-aligned contrastive loss. This framework captures both local semantic correspondence (species- and trait-level) and global distributional coherence (community- and ecosystem-level) under conditions of ecological asymmetry. By modeling the probabilistic divergence between aboveground and belowground representations, the method effectively bridges the semantic and structural gaps that conventional similarity indices and RDA-based approaches cannot address, providing a more realistic and interpretable view of cross-hierarchical ecosystem coupling.

2.2. Contrastive Learning and Ecological Multimodal Alignment

With the increasing dimensionality of ecological data and the growing complexity of data sources, traditional statistical approaches face bottlenecks such as fragmented ecological information, weak modal correspondence, and limited generalization capacity [30]. Consequently, ecological modeling has increasingly incorporated machine learning, particularly representation learning, to capture latent semantic associations across ecological modalities [13]. Against this backdrop, contrastive learning—one of the core paradigms of unsupervised representation learning—has emerged as a powerful framework for modeling aboveground–belowground ecological modal alignments [31]. The fundamental principle of contrastive learning is as follows: given a set of positive pairs (e.g., aboveground vegetation and soil seed bank feature vectors from the same ecological site) and multiple negative pairs (e.g., samples from different sites or environmental conditions), neural networks are trained to embed positive pairs closer in semantic space while pushing negative pairs apart [32]. The widely adopted SimCLR framework [33] operationalizes this process through the InfoNCE loss, optimizing instance-level similarity based on data augmentation and temperature-scaled softmax contrast. This design effectively improves discriminability and clustering consistency in balanced and symmetric data domains, such as vision and language. However, when directly applied to ecological multimodal data, SimCLR and similar symmetric contrastive paradigms exhibit significant limitations. They implicitly assume comparable sample distributions and equal semantic densities between paired modalities. In real-world ecosystems, aboveground vegetation and belowground seed banks differ markedly in feature sparsity, compositional diversity, and functional semantics—resulting in asymmetric representation distributions and non-stationary correlations across ecological gradients [34]. Consequently, a SimCLR-style symmetric objective may overemphasize local feature proximity while neglecting cross-modal distribution alignment, leading to unstable embedding convergence and reduced ecological interpretability.

To address these limitations, recent studies have introduced distribution alignment terms—for example, minimizing divergences such as KL or JS divergence between the two modal representation distributions—to enforce statistical consistency and improve robustness under ecological asymmetry [35]. Building upon this foundation, the proposed framework integrates distribution-aligned contrastive learning with a disturbance-aware attention mechanism to jointly optimize semantic correspondence and distributional coherence. This design enables the model to capture both local semantic coupling (via contrastive learning) and global distribution alignment (via probabilistic regularization), thereby effectively mitigating the intrinsic asymmetry between vegetation and seed bank representations and advancing ecological semantic modeling toward greater stability and interpretability.

2.3. Land Use and Ecosystem Services Research

In addition to biological mechanisms and disturbances, changes in LULC also play a significant role in shaping vegetation–seed bank relationships [36]. Land use directly influences community composition and propagule availability, thereby affecting ecosystem service provision. Different LULC types not only alter the quantity and quality of ecosystem services but also affect alignment between aboveground vegetation and soil seed banks, thereby influencing ecological restoration potential [37]. For example, conversion of grassland to cropland may lead to loss of functional species in seed banks, reducing future recovery capacity, whereas forest restoration can foster species accumulation, enhancing biodiversity maintenance services. Integrating land use patterns with ecosystem service assessments thus provides insights into the practical significance of cross-level alignments.

3. Materials and Method

3.1. Data Collection

In this study, data collection was conducted in several representative grassland ecosystems, including alpine meadows on the Qinghai–Tibet Plateau and desert steppes in western Inner Mongolia, as shown in Table 1. These regions exhibit substantial differences in climate conditions, vegetation composition, and histories of anthropogenic disturbance, providing a broad ecological gradient for cross-level alignment modeling. Aboveground vegetation survey data were primarily obtained through quadrat sampling. Field investigations were carried out during the growing seasons of 2022 and 2023 (June to August). Standard quadrats of 1 m × 1 m were randomly established in each study site, and all plant species and their individual counts were recorded. Species abundance was estimated by combining visual cover assessments with individual counts. Soil seed bank samples were simultaneously collected during the same period. To minimize spatial bias, soil samples were extracted from the center and four corners of each vegetation quadrat at a depth of 0–10 cm. Five subsamples from each plot were pooled to form a representative sample. Germination experiments for the seed bank were conducted under controlled laboratory conditions, using natural light and a constant temperature regime (25 °C during the day and 15 °C at night). Species emergence and germination counts were monitored continuously for a period of 90 days, ensuring that delayed germination across species was fully captured. Environmental variable data were acquired from multiple sources. Topographic factors, such as elevation and slope, were obtained through portable GPS measurements and digital elevation model (DEM) interpretation. Soil nutrients, including total nitrogen, organic matter, available phosphorus, and available potassium, were determined by laboratory chemical analysis. Additional climatic variables were supplemented using daily observation records from the China Meteorological Data Sharing Service. Disturbance history information was derived through a combination of field interviews and remote sensing interpretation. Local herders and management agencies provided records of grazing intensity, fire events, and mechanical disturbances, while multi-temporal Landsat and Sentinel-2 imagery was employed to extract land-use change trajectories, enabling quantification of disturbance types and frequencies over the past decade. Finally, all data were standardized and integrated into a vegetation–seed bank ecological paired dataset, comprising N aboveground–belowground sample pairs. Each pair contained species composition, abundance information, germination experiment results, environmental variables, and disturbance records, thereby establishing a robust foundation for deep modeling of cross-level alignment.

3.2. Data Enhancement

Prior to deep learning modeling, the preprocessing and augmentation of ecological data are fundamental steps to ensure effective model training, directly influencing model generalization capacity, alignment accuracy, and ecological interpretability. Considering the multi-source nature, sparsity, and structural complexity of ecological data, a series of standardization, encoding, and augmentation strategies were designed in the preprocessing stage. The aim was to construct a semantically consistent multimodal representation space, thereby enhancing the learning capacity and robustness of species composition alignment between aboveground vegetation and soil seed banks.

In the raw ecological survey data, community species composition is generally represented as a species–abundance matrix. Since differences in species abundance across sites may be affected by sampling area, germination rate, and biomass, the direct use of raw abundance data may cause gradient shifts during model training. Therefore, normalization of species abundance at each site is required. Given n species at one site, with abundance vector

x = [x_{1}, x_{2}, \dots, x_{n}]

, the normalized abundance representation

\hat{x}

is calculated as follows:

{\hat{x}}_{i} = \frac{x_{i}}{\sum_{j = 1}^{n} x_{j}} .

(1)

This procedure maps species abundances at each site into a probability distribution, ensuring comparability across sites and preventing total abundance differences from interfering with model learning. Furthermore, many rare species occur in only a few sites within aboveground or belowground communities. These rare species may act as noise signals in alignment modeling, potentially causing “semantic sparsity due to sample sparsity” in deep learning. To address this, a rare species handling mechanism was introduced. A frequency threshold

θ

was defined, and the occurrence frequency

f_{i}

of species i across all sites was evaluated as:

f_{i} = \frac{1}{N} \sum_{j = 1}^{N} I (x_{i j} > 0),

(2)

where N denotes the total number of sites,

x_{i j}

indicates the abundance of species i in site j, and

I (\cdot)

is the indicator function. When

f_{i} < θ

, species i is considered rare and is either removed or aggregated depending on subsequent task requirements. In this study, rare species with similar functional traits were aggregated into a “functional group” to avoid the risk of eliminating important ecological functions.

For multimodal feature construction, both aboveground and belowground samples were encoded as vectors

v_{a}

and

v_{b}

, respectively. Their feature dimensions incorporated species composition distributions, community functional trait indices (e.g., SLA, seed size, life form), and community structural parameters (e.g., Shannon diversity index, Pielou’s evenness). For instance, the Shannon index H is defined as:

H = - \sum_{i = 1}^{n} {\hat{x}}_{i} log {\hat{x}}_{i} .

(3)

This index reflects the overall information entropy of species richness and abundance distribution within a community. In feature encoding, all ecological indicators were normalized and concatenated to construct a multimodal embedding space, expressed as:

v_{a} = [{\hat{x}}_{a}, |, f_{a}, |, s_{a}], v_{b} = [{\hat{x}}_{b}, |, f_{b}, |, s_{b}],

(4)

where

f_{a}

and

f_{b}

denote functional trait vectors,

s_{a}

and

s_{b}

represent community structural indices, and | indicates concatenation.

In terms of data augmentation, given the limited number of ecological samples and the sensitivity of contrastive learning frameworks to the diversity of negative samples during early training, two augmentation strategies were designed: random sample mixing (MixSample) and perturbation-based expansion (Perturb-Expand). The first strategy, inspired by MixUp, generates intermediate ecological communities by linearly combining the features of two samples, thereby increasing training diversity. Given two input vectors

v_{1}

and

v_{2}

, the mixed sample

v_{mix}

is expressed as:

v_{mix} = λ v_{1} + (1 - λ) v_{2}, λ \sim Beta (α, α),

(5)

where

λ

is the mixing weight sampled from a Beta distribution, controlling the combination ratio of the two samples, and

α

is a hyperparameter that regulates the mixing degree. This approach simulates continuous transitions between ecological states, enabling the model to learn more flexible ecological embeddings.

For perturbation-based augmentation, slight perturbations were introduced to environmental variables and functional traits, simulating micro-ecological shifts caused by small-scale disturbances or natural variability. Given a feature vector

v

, the perturbed sample

\tilde{v}

is defined as:

\tilde{v} = v + ϵ, ϵ \sim N (0, σ^{2} I),

(6)

where

ϵ

is a noise term drawn from a zero-mean Gaussian distribution, and

σ

controls perturbation intensity. This method enhances model robustness to local perturbations, simulates community structural shifts under varying disturbance intensities, and facilitates the learning of implicit correspondences during ecological alignment. Beyond the aforementioned augmentation strategies, recent research has highlighted the potential of generative alignment techniques to further enhance multimodal stability. Given the variability at both spectral and structural levels across vegetation–soil modalities, penalty-based generative frameworks such as modified Wasserstein GANs with gradient penalty have demonstrated substantial gains in spectral fidelity and distributional smoothness across heterogeneous domains (e.g., RGB and infrared) [38]. This perspective underscores the value of incorporating penalty-based loss constraints for improving cross-domain representation consistency. While our current framework employs contrastive and perturbation-based augmentation, future extensions may integrate generative alignment regularization to further enhance ecological embedding stability under large spectral or structural shifts.

3.3. Proposed Method

3.3.1. Overall

The proposed methodological framework begins with the input of preprocessed multimodal ecological data and proceeds through a series of interconnected modules to achieve cross-hierarchy alignment modeling. As illustrated in Figure 1, the standardized and feature-constructed vectors of aboveground vegetation and soil seed banks are first fed into the ecological semantic encoder. This encoder, built upon a multi-channel Transformer architecture, performs deep semantic representation learning across modalities. Through a multi-head attention mechanism, it captures latent associations among species composition, functional traits, and community structural indicators, ultimately producing comparable semantic embeddings in a shared high-dimensional space. These embeddings are then passed to the distribution alignment module, which aims to align the representation distributions between modalities. At this stage, the model simultaneously measures sample-level semantic proximity and enforces distribution-level consistency, ensuring that the overall representation space remains stable and coherent under cross-modality conditions. Consequently, ecological features from above- and belowground components are effectively matched within a unified semantic domain. Next, the disturbance-aware attention mechanism dynamically regulates the weighting of sample pairs during contrastive learning. By incorporating external environmental variables and disturbance records, this mechanism adaptively adjusts attention weights according to disturbance intensity and type. In doing so, it emphasizes ecologically meaningful sample pairs that contribute to robust alignment, while suppressing the influence of noisy or severely disturbed samples—thereby enhancing model resilience in complex ecological scenarios. Finally, all encoded, aligned, and disturbance-modulated representations enter the contrastive optimization stage. Here, the model refines semantic alignment by minimizing the relative distances between positive sample pairs and maximizing those between negative pairs in the embedding space. This process encourages convergence of ecologically similar communities while maintaining sufficient separation among dissimilar ones, leading to an embedding structure that balances discriminability and generalizability. Overall, the entire pipeline operates as an integrated system: the encoder provides semantic representation, the distribution alignment ensures cross-modality coherence, the attention mechanism introduces adaptive disturbance modulation, and the contrastive optimization completes semantic refinement. Through this synergistic integration, the framework achieves deep ecological semantic alignment between aboveground vegetation and soil seed banks, establishing a solid foundation for subsequent analyses of land-use patterns and ecosystem service interactions. Detailed mathematical derivation can be found in the Appendix A.

3.3.2. Ecological Semantic Encoder

The ecological semantic encoder is designed to effectively map multimodal features of aboveground vegetation and soil seed banks into a unified high-dimensional semantic space, thereby enabling deep alignment across hierarchical levels. As shown in Figure 2, the encoder is constructed on an improved multi-channel Transformer structure. Its inputs are standardized and feature-constructed vector representations, denoted as the aboveground vegetation feature vector

v_{a}

and the soil seed bank feature vector

v_{b}

. Structurally, the encoder first projects the input vectors into a d-dimensional embedding space through linear transformation layers to ensure dimensional consistency across different feature sources. A multi-head self-attention mechanism with rotary positional encoding (RoPE) is then employed to model dependencies across species and feature dimensions, formalized as

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

, where Q, K, and V are linear projections of the inputs and

d_{k}

is the scaling factor. By performing parallel computation in multiple heads, the model captures multi-level semantic associations among community composition, functional traits, and structural features.

After attention computation, the encoder incorporates a convolution-enhanced module with adaptive layer normalization (adaLN), which applies local convolutional operations (ConvMLP) to strengthen neighborhood structural features in the embedding. This ensures that the encoded representations capture not only global dependencies but also local community patterns. Specifically, after each attention layer, the convolution-enhanced module performs residual connection and normalization, expressed as

h_{l + 1} = adaLN (h_{l} + ConvMLP (h_{l}))

, where

h_{l}

is the output of the l-th layer. This residual design stabilizes deep network training and enhances the model’s ability to detect local patterns in complex ecological data.

In terms of parameter configuration, the encoder consists of L stacked Transformer blocks, each containing h attention heads. The embedding dimension is set to

d = 256

, the feed-forward network is expanded to

4 d

, and the convolution kernel size is

3 \times 3

, ensuring that local features are captured without losing multi-scale community structure. Finally, the encoder outputs embedding vectors are aggregated via average pooling to obtain sample-level global representations, denoted as

z_{a}

and

z_{b}

for aboveground vegetation and soil seed bank samples, respectively.

The advantages of this design in the present task are threefold. First, the multi-head attention mechanism captures nonlinear and cross-modal complex relationships in high-dimensional space, revealing deep correspondences between aboveground and belowground communities in terms of functional traits and species composition. Second, the convolutional enhancement module provides the encoder with the ability to model local community structures, thereby maintaining robustness against sample sparsity and abundance heterogeneity. Third, the adaptive normalization strategy enhances the model’s generalization performance under varying ecological gradients, effectively mitigating distribution shifts caused by differences across regions and disturbance histories. Through this encoder, aboveground vegetation and soil seed banks are aligned semantically within a unified embedding space, thus providing high-quality inputs for subsequent modules such as distribution alignment and disturbance-aware modeling, establishing the core foundation for cross-hierarchy ecological modeling.

3.3.3. Distribution-Aligned Contrastive Loss

The design of the distribution-aligned contrastive loss originates from the need to overcome the limitations of conventional contrastive learning loss functions, which rely solely on constraints imposed by individual positive and negative sample pairs. Instead, the focus is extended to the distribution level to address the asymmetry and distributional shifts frequently observed between aboveground vegetation and soil seed banks. In traditional loss functions such as InfoNCE or Triplet Loss, the optimization objective is to minimize the distance between positive sample pairs while maximizing the separation from negative sample pairs, with the formulation based exclusively on pairwise similarity. However, in ecological semantic alignment tasks, the semantic spaces of aboveground communities and soil seed banks differ not only at the level of individual samples but also in their overall distributional structures, often resulting in mismatches and inconsistencies. Consequently, the exclusive use of pairwise similarity may cause the learned representations to neglect population-level statistical regularities, thereby impeding robust cross-hierarchy alignment.

As shown in Figure 3, to address this limitation, the distribution-aligned contrastive loss introduces an additional distribution-level constraint into the conventional contrastive learning framework. This ensures that during optimization, embeddings from different modalities are aligned not only at the sample level but also at the global distribution level. Formally, let the embedding distribution of aboveground vegetation be

P_{a}

and that of the soil seed bank be

P_{b}

. The distributional discrepancy between the two is defined using Maximum Mean Discrepancy (MMD), expressed as:

D_{MMD} (P_{a}, P_{b}) = {∥\frac{1}{n_{a}} \sum_{i = 1}^{n_{a}} ϕ (v_{a}^{i}) - \frac{1}{n_{b}} \sum_{j = 1}^{n_{b}} ϕ (v_{b}^{j})∥}^{2},

(7)

where

ϕ (\cdot)

is a kernel mapping function,

n_{a}

and

n_{b}

denote the number of samples in each modality, and

v_{a}^{i}

and

v_{b}^{j}

are the sample vector representations. This metric captures the discrepancy between two modalities in a high-dimensional feature space. On the other hand, to maintain local alignment accuracy, the standard contrastive loss is preserved. Denoting the similarity function between embeddings as

sim (v_{a}^{i}, v_{b}^{j})

, the local contrastive loss is defined as:

L_{local} = - \frac{1}{n} \sum_{i = 1}^{n} log \frac{exp (sim (v_{a}^{i}, v_{b}^{i}) / τ)}{\sum_{j = 1}^{n} exp (sim (v_{a}^{i}, v_{b}^{j}) / τ)} .

(8)

The final distribution-aligned contrastive loss integrates both sample-level and distribution-level constraints, given as:

L_{DA} = L_{local} + λ D_{MMD} (P_{a}, P_{b}),

(9)

where

λ

is a balancing coefficient that regulates the relative importance of local alignment and distribution alignment. It can be mathematically demonstrated that when

λ > 0

, the optimal solution of the loss function requires not only the minimization of distances between positive sample pairs but also the consistency of mean embeddings between the two modalities, thereby achieving distribution alignment at the population level. This design ensures that robust cross-modal representations can be learned even in contexts characterized by heterogeneity and asymmetry.

From an ecological perspective, minimizing the distributional discrepancy term—whether expressed as KL divergence or MMD distance—can be interpreted as enforcing community-level similarity between the ecological “signatures” of aboveground vegetation and soil seed banks. In this context, each embedding distribution represents the probabilistic composition and functional spectrum of a community. Therefore, reducing the divergence between

P_{a}

and

P_{b}

corresponds to minimizing the mismatch between the functional or compositional niches of aboveground and belowground communities. This process aligns with the ecological concept of community convergence, where structurally or functionally related assemblages exhibit similar trait distributions under shared environmental constraints. In other words, the distribution-aligned contrastive loss mathematically formalizes the ecological principle that stable vegetation–seed bank systems should maintain consistent functional diversity and compositional balance across ecosystem layers, even when local samples differ due to stochastic or disturbance-driven variation.

Applied to the present task, the advantages of distribution-aligned contrastive loss are twofold. First, within ecological paired datasets of aboveground vegetation and soil seed banks, species composition often exhibits sample sparsity and structural heterogeneity. The distribution alignment mechanism effectively mitigates embedding shifts induced by rare species or extreme samples, thereby enabling the model to focus more on global regularities. Second, in cross-regional or multiple-land-use contexts, significant ecological gradients often lead to distributional divergences between aboveground and belowground communities. Conventional loss functions struggle to adapt to such distribution shifts, whereas the distribution-aligned contrastive loss enforces global distribution consistency, ensuring stable alignment performance when generalizing to new regions. By jointly considering both local and global perspectives, this approach not only enhances alignment accuracy but also strengthens the ecological interpretability of the model, providing solid mathematical support for uncovering semantic associations across hierarchical levels.

3.3.4. Disturbance-Aware Attention Module

The design of the disturbance-aware attention mechanism differs fundamentally from that of conventional self-attention, with its core innovation lying in the incorporation of external environmental and anthropogenic disturbance variables as priors for generating attention weights, rather than relying solely on the autocorrelation of the input sequence. In self-attention, the attention weight

α_{i j}

is calculated entirely based on the similarity of input vectors, typically formulated as

α_{i j} = \frac{exp (q_{i} \cdot k_{j} / \sqrt{d})}{\sum_{j} exp (q_{i} \cdot k_{j} / \sqrt{d})},

(10)

where

q_{i}

and

k_{j}

denote the query and key projections, respectively, and d is the scaling factor. In contrast, in the disturbance-aware attention mechanism, the weight allocation depends not only on the similarity of input vectors but is also modulated by a disturbance vector

D \in R^{n}

. Specifically, the disturbance vector is processed through two fully connected layers for nonlinear mapping, where the first layer has dimensions

[n, 128]

and the second layer has dimensions

[128, h]

, with h consistent with the input feature dimension. This process generates disturbance-sensitive correction factors in the embedding space. The attention weight is then reformulated as

α_{i j} = \frac{exp ((q_{i} \cdot k_{j} + d_{i}) / \sqrt{d})}{\sum_{j} exp ((q_{i} \cdot k_{j} + d_{i}) / \sqrt{d})},

(11)

where

d_{i} = W_{2} σ (W_{1} D_{i})

,

W_{1} \in R^{n \times 128}

,

W_{2} \in R^{128 \times h}

, and

σ

represents a nonlinear activation function. This formulation demonstrates that disturbance factors directly participate in reconstructing the attention distribution, thereby enhancing the model’s ability to perceive external ecological disturbance signals.

As shown in Figure 4, the disturbance-aware attention mechanism adopts a four-layer stacked multi-head attention module, with each layer consisting of eight parallel attention heads, each of dimension 64, producing an overall output dimension of 512. The input feature tensor has spatial dimensions

[32, 32, 512]

. The first layer processes a two-dimensional feature map of size

32 \times 32

, which is embedded through convolution before being passed to the attention layer. The second layer progressively downsamples the representation to

[16, 16, 512]

to capture local spatial context, while the third layer reduces the resolution to

[8, 8, 512]

to strengthen global dependency modeling. The fourth layer restores the dimension to

[16, 16, 512]

and fuses it with the initial features to ensure the integration of fine-grained and global information. This hierarchical design guarantees that disturbance factors exert modulation effects across multiple spatial scales, while residual connections and normalization operations prevent gradient vanishing. Mathematically, it can be shown that this mechanism is equivalent to adding a bias term to the traditional attention distribution, such that the attention matrix ceases to satisfy symmetry under disturbance conditions. This asymmetry aligns with the ecological responses of aboveground vegetation and soil seed banks, which are typically asymmetric under external disturbances.

When used jointly with the ecological semantic encoder, the disturbance-aware attention mechanism provides disturbance-sensitive attention distributions, while the semantic embeddings output by the encoder are further updated through disturbance modulation. Let the vector produced by the encoder be

h_{i}

, the disturbance-modulated representation is expressed as

{\tilde{h}}_{i} = \sum_{j} α_{i j} v_{j},

(12)

where

v_{j}

represents the value feature vectors. Thus, the disturbance-aware attention module is not an independent entity but functions as a regulator of the semantic alignment process, ensuring that the model can still learn robust cross-hierarchical alignments under varying disturbance intensities. This joint design enables the model to demonstrate higher robustness and interpretability when confronted with non-explicit correspondences and distributional imbalances, thereby providing strong support for cross-hierarchical ecological semantic alignment.

4. Results and Discussion

4.1. Evaluation Metrics

Multiple quantitative metrics were employed to comprehensively evaluate the alignment performance, distributional consistency, and ecological structure recovery capability of the proposed model. These include alignment accuracy metrics (Top-k matching accuracy and mean cosine similarity), distributional divergence metrics (KL divergence and Earth Mover’s Distance), and ecological community reconstruction metrics (Jaccard similarity, Sørensen coefficient, and NMDS stress residual). Furthermore, ablation experiments were conducted to analyze the contribution of each module to the overall performance improvement.The corresponding mathematical formulations are defined as follows:

Top - k Accuracy = \frac{1}{N} \sum_{i = 1}^{N} I (y_{i} \in Top - k ({\hat{y}}_{i})),

(13)

Mean Cosine Similarity = \frac{1}{N} \sum_{i = 1}^{N} \frac{v_{a}^{i} \cdot v_{b}^{i}}{| v_{a}^{i} | | v_{b}^{i} |},

(14)

D_{K L} (P | | Q) = \sum_{i = 1}^{n} P (i) log \frac{P (i)}{Q (i)},

(15)

EMD (P, Q) = inf_{γ \in Γ (P, Q)} E_{(x, y) \sim γ} [| x - y |],

(16)

J (A, B) = \frac{| A \cap B |}{| A \cup B |}, S (A, B) = \frac{2 | A \cap B |}{| A | + | B |},

(17)

Stress * NMDS = \sqrt{\frac{\sum i < j {(d_{i j} - δ_{i j})}^{2}}{\sum_{i < j} δ_{i j}^{2}}} .

(18)

In these formulations, N denotes the number of samples, and

v_{a}^{i}

and

v_{b}^{i}

represent the embedded vectors of aboveground vegetation and belowground seed bank for the i-th pair, respectively. The variable

y_{i}

indicates the true matching label, while

\hat{y} * i

denotes the predicted result. The terms P and Q correspond to the embedding distributions of the aboveground and belowground modalities, and

γ

represents the joint distribution in the transport space. Sets A and B denote species compositions of two ecological communities, whereas

d * i j

and

δ_{i j}

represent the ecological distances after NMDS dimensional reduction and in the original space, respectively. These metrics collectively capture both local alignment precision and global ecological consistency, thereby providing a rigorous and interpretable evaluation of cross-hierarchical semantic alignment performance.

4.2. Experiment Settings

4.2.1. Hardware and Software Configuration

The experimental setup was implemented on a high-performance computing server equipped with substantial computational and storage capabilities for large-scale ecological data processing and deep model training. The hardware configuration consisted of dual Intel Xeon Gold 6338 CPUs, each featuring 32 physical cores, yielding a total of 64 cores and 128 threads, thereby providing sufficient computational resources for data preprocessing and parallel training. A total of 512 GB DDR4 memory was allocated to ensure efficient loading and batch processing of extensive ecological datasets. The storage subsystem utilized a 4 TB NVMe SSD to achieve high-speed read/write operations for training data and intermediate model parameters, optimizing I/O performance. The graphical computing units comprised four NVIDIA A100 GPUs (80 GB each), featuring third-generation Tensor Cores and multi-instance GPU (MIG) support, which substantially accelerated Transformer-based architectures and enhanced parallelism in contrastive learning tasks. The software environment was configured on Ubuntu Server 22.04 LTS, providing a stable and compatible Linux environment. Model development and execution were conducted in Python 3.10 using the PyTorch 2.1.0 deep learning framework, complemented by PyTorch Lightning 2.1.1 for modular architecture design and efficient training control. Ecological data preprocessing and statistical analyses were performed with pandas, numpy, scikit-learn, and scipy, while ecological similarity and community structure analyses were implemented using the vegan and scikit-bio toolkits. Visualization tasks were supported by matplotlib, seaborn, and plotly. The entire experimental environment was containerized in Docker to ensure software dependency consistency, and the training process was GPU-accelerated through CUDA 12.2 and cuDNN 8.9. Training logs and evaluation metrics were monitored and visualized in real time using TensorBoard and Weights and Biases, facilitating transparent experiment management and reproducibility.

4.2.2. Hyperparameter Settings

The dataset was divided following a standard train-validation-test split to ensure robust generalization and stable model performance. Specifically, the complete set of aboveground–belowground ecological paired samples was partitioned in a 7:1:2 ratio, with 70% of the samples used for model training, 10% for hyperparameter tuning and early stopping validation, and the remaining 20% reserved for independent testing to assess alignment performance and ecological structure reconstruction on unseen data. The principal hyperparameters were configured as follows. The initial learning rate was set to

α = 1 \times 10^{- 4}

, and model optimization was performed using the Adam optimizer with momentum parameters

β_{1} = 0.9

and

β_{2} = 0.999

. A weight decay coefficient of

1 \times 10^{- 5}

was applied to prevent overfitting. The temperature parameter in contrastive learning was fixed at

τ = 0.07

, controlling the sensitivity of the loss function to positive and negative sample similarity differences. Each training batch contained 64 samples, and the maximum number of training epochs was set to 200, with early stopping triggered after 15 consecutive epochs without improvement on the validation set. For MixSample augmentation, the mixing coefficient

λ

was drawn from a Beta distribution

Beta (α, α)

with

α = 0.5

, while for perturbation-based augmentation, the disturbance amplitude followed a normal distribution

N (0, σ^{2})

with

σ = 0.01

. The entire training procedure was executed under fixed random seeds and reproducibility protocols to ensure consistent results. All hyperparameters were determined through grid search combined with validation performance evaluation to achieve an optimal balance among accuracy, stability, and ecological interpretability.

4.2.3. Baseline Methods

The baseline methods adopted in this study included three non-learning approaches—RDA [39], canonical correspondence analysis (CCA) [40], and Procrustes analysis [41]—as well as five deep learning models: the multilayer perceptron matching model (MLP Matching) [42], the Siamese network [43], the SimCLR contrastive learning framework [33], the DisAlign [44], a distribution-alignment framework, and the TMFNet [45], a multimodal Transformer approach. These methods collectively spanned traditional statistical paradigms and modern deep representation learning techniques, allowing a comprehensive evaluation of the proposed model’s effectiveness. All baseline models were trained and evaluated on the same dataset splits and under consistent evaluation criteria, while each was configured with its own optimal hyperparameters as recommended in the original literature or determined through cross-validation to ensure a fair and reproducible comparison.

4.3. Overall Performance Comparison Between the Proposed Model and Baseline Methods

This experiment was designed to validate the overall performance and robustness of the proposed model in cross-hierarchical ecological semantic alignment by systematically comparing it with multiple traditional statistical and deep learning baselines. The core objective was to assess the superiority of the model in terms of semantic alignment accuracy, feature consistency, and community structure preservation. To this end, eight representative baseline methods covering three paradigms—linear ordination, spatial matching, and deep contrastive learning—were selected and evaluated using four metrics: top-1 and top-5 matching accuracy, mean cosine similarity, and Jaccard index.

As shown in Table 2 and Figure 5, the traditional statistical models RDA, CCA, and Procrustes exhibited relatively low performance across all metrics. RDA and CCA partially captured ecological gradients in low-dimensional linear spaces; however, their reliance on linear assumptions and limited expressive capacity restricted their ability to characterize complex aboveground–belowground interactions. The Procrustes method achieved slight improvement through geometric alignment but remained constrained by rigid transformations, leading to suboptimal performance in high-dimensional semantic alignment. In contrast, deep learning-based methods demonstrated clear performance gains. The MLP Matching and Siamese Network models introduced nonlinear mapping through deep representation learning, improving cross-modal feature similarity. SimCLR further enhanced discriminability by adopting an unsupervised contrastive learning framework with data augmentation and information entropy maximization, achieving stronger alignment at the instance level. Building on this, TMFNet [45] leveraged a multimodal Transformer backbone with cross-modal attention to jointly model heterogeneous ecological signals, enabling better contextual fusion of vegetation and soil attributes. DisAlign [44] further improved distributional consistency through adaptive contrastive regularization, effectively aligning heterogeneous feature distributions and mitigating modality imbalance. These advanced baselines achieved substantial improvements over SimCLR, demonstrating the benefits of multimodal fusion and distribution-alignment mechanisms. The proposed model, however, outperformed all baselines across every evaluation metric. By integrating a multi-channel Transformer encoder, distribution-aligned regularization, and disturbance-aware attention, it achieved superior nonlinear mapping and distribution robustness. During optimization, contrastive learning maximized cross-modal semantic consistency, while distribution-level constraints ensured global stability of feature alignment, thereby mitigating ecological semantic shifts. The model’s advantage over both TMFNet and DisAlign stems from its hierarchical design: the Transformer architecture captures higher-order dependencies between species composition and functional traits; the distribution-aligned loss enforces statistical consistency across modalities; and the disturbance-aware attention dynamically adjusts to environmental variability. Consequently, the proposed model advances from point-wise similarity learning to distribution-consistent ecological semantic modeling, achieving leading performance in top-1 accuracy (78.6%) and mean cosine similarity (0.784), confirming its effectiveness, robustness, and generalizability for cross-hierarchical ecosystem representation learning.

4.4. Comprehensive Evaluation Across Alignment, Distribution, and Structural Recovery Metrics

This experiment aimed to comprehensively evaluate the overall performance of the proposed model from multiple perspectives, including sample-level alignment accuracy, distributional consistency, and community structural recovery. Unlike the previous comparison, this analysis emphasized not only the individual sample alignment between aboveground and belowground communities but also the global consistency of feature distributions and ecological structures. The selected evaluation metrics encompassed statistical distribution distances (KL divergence and EMD), community similarity indicators (Sørensen coefficient and NMDS stress), and alignment accuracy metrics (top-1 accuracy and mean cosine similarity), jointly reflecting model performance in probabilistic, semantic, and ecological dimensions.

As shown in Table 3, linear models such as RDA and CCA exhibited the highest KL divergence and EMD values, indicating substantial embedding distributional shifts and limited capacity to capture nonlinear ecological dependencies. Procrustes alignment achieved moderate geometric improvement but remained constrained by rigid spatial transformations, which restricted its adaptability to complex multimodal feature spaces. Deep learning-based models demonstrated progressive enhancements. MLP Matching utilized nonlinear transformations for feature learning, improving sample-level similarity, while the Siamese Network introduced metric learning to enhance alignment precision. SimCLR further reduced distributional bias through unsupervised contrastive optimization with data augmentation, yielding stronger representation consistency across modalities. Building upon these, TMFNet [45] incorporated a multimodal Transformer backbone with cross-modal attention to jointly model aboveground and belowground feature dependencies, effectively improving structural alignment. DisAlign [44] achieved even greater gains in KL divergence and EMD reduction by explicitly optimizing adaptive distribution alignment via contrastive regularization, highlighting the importance of statistical consistency in multimodal ecological embedding. The proposed model, however, consistently outperformed all baselines across every metric, achieving the lowest KL divergence (0.128), smallest EMD (0.107), highest Sørensen coefficient (0.713), and minimal NMDS stress (0.094). This confirms its superiority in both sample-level correspondence and global-scale distributional coherence. Theoretical analysis attributes these improvements to the integration of a multi-channel Transformer encoder, distribution-aligned loss, and disturbance-aware attention. Unlike RDA and CCA, which optimize only linear correlations, or Procrustes, which minimizes geometric distortion without statistical adaptivity, the proposed framework jointly optimizes multi-head attention and probabilistic distribution alignment, promoting convergence of embedding distributions under expectation constraints—effectively minimizing both feature divergence and statistical bias. The disturbance-aware attention mechanism further introduces adaptive weighting that stabilizes representation learning under heterogeneous ecological conditions. Through the unified alignment of semantic, structural, and statistical dimensions, the proposed model achieves enhanced robustness, ecological interpretability, and consistent performance across complex environmental gradients.

4.5. Ablation Study Evaluating the Contribution of Each Module in the Proposed Framework

This experiment aimed to verify the individual contributions and synergistic effects of the core modules in the proposed framework through systematic ablation analysis, thereby revealing the internal mechanisms underlying performance improvement. Three components were independently removed—Ecological Semantic Encoder, Disturbance-Aware Attention Module, and Distribution-Aligned Contrastive Loss—to evaluate changes in semantic alignment accuracy, distributional consistency, and ecological structure reconstruction.

As shown in Table 4, removing the Ecological Semantic Encoder led to a significant drop in top-1 accuracy and mean cosine similarity, along with an increase in KL divergence, indicating its critical role in constructing high-dimensional semantic representations of aboveground and belowground features. Excluding the Disturbance-Aware Attention Module slightly reduced performance, revealing its importance in dynamically perceiving environmental disturbances and maintaining alignment stability. Eliminating the Distribution-Aligned Loss preserved local matching precision but weakened global distributional consistency, emphasizing its central function in balancing local and global optimization. The full model achieved the best performance across all metrics, demonstrating the complementary synergy among the three components, which jointly ensure robust cross-hierarchical semantic alignment and ecological structure recovery.

4.6. Disturbance Sensitivity

This experiment aimed to quantitatively evaluate the sensitivity and interpretability of the proposed disturbance-aware attention mechanism under different ecological disturbance regimes. Specifically, three dominant disturbance variables were examined: grazing intensity, fire frequency, and land-use change. These variables respectively represent biological, abiotic, and anthropogenic disturbance gradients that frequently reshape both aboveground vegetation and soil seed bank compositions. By systematically varying these factors, the experiment sought to determine (1) how the model’s attention response varies with disturbance type and intensity, and (2) which disturbances exert the strongest influence on ecological semantic alignment. Each disturbance type was partitioned into three levels—low, medium, and high—based on field survey metadata. For each level, we measured the mean attention activation, distributional stability index (DSI), and embedding divergence (

Δ

MMD) before and after disturbance modulation, reflecting how effectively the model adapts to external perturbations.

As shown in Table 5, the disturbance-aware attention exhibited distinct response patterns across disturbance types. The highest sensitivity was observed under land-use change, where attention activation increased by 26.4% and

Δ

MMD decreased by 0.032 on average, indicating strong adaptation to large-scale anthropogenic transformations such as cropland–grassland or forest–pasture transitions. These changes drastically reshape both community composition and soil propagule banks, leading to high ecological asymmetry that the model effectively compensates for through attention modulation. Fire disturbance showed intermediate effects, reflecting the short-term but spatially heterogeneous influence of burning on species regeneration and seed viability. Grazing, while inducing gradual structural heterogeneity, yielded the lowest sensitivity, consistent with its localized and cumulative ecological impact. Overall, the results confirm that the disturbance-aware attention mechanism captures cross-scale ecological perturbations, dynamically reweights alignment strength under varying disturbance regimes, and enhances both the robustness and interpretability of ecological semantic modeling.

4.7. Discussion

The experimental results of this study demonstrate that the proposed framework integrating ecological semantic encoding and distribution-aligned loss not only achieves remarkable quantitative improvements (Top-1 accuracy 78.6%, cosine similarity 0.784, KL divergence 0.128, etc.) but also reveals the underlying ecological coupling between aboveground vegetation and underground seed banks. The high accuracy and similarity indicate that the model can identify functionally equivalent community units, suggesting that it captures ecological functional redundancy and potential regeneration capacity—meaning that even when surface vegetation changes, the soil seed bank retains the potential to restore the original community structure, thus revealing the system’s intrinsic ecological resilience. The reduced KL divergence reflects the model’s ability to align ecological semantics at the distributional level, corresponding ecologically to system-level functional stability and buffering capacity: the convergence of functional compositions implies that despite external disturbances, key functional groups remain balanced, maintaining ecosystem homeostasis. The disturbance-aware attention mechanism enables the model to sustain strong semantic alignment under varying intensities of land-use change, fire, and grazing disturbances, an outcome that ecologically corresponds to adaptive ecosystem responses, where functional groups reorganize and compensate through structural and trait-based adjustments to sustain ecosystem functioning and succession. Meanwhile, the Jaccard index (0.512) and Sørensen coefficient (0.713) indicate a high capacity for capturing species and functional diversity patterns, implying that the model not only predicts community composition but also reflects the co-recovery of functional and phylogenetic diversity. Collectively, these quantitative indicators carry ecological significance: they suggest that the model unveils the hidden restorative potential and diversity-maintaining mechanisms within forest ecosystems. This provides a novel, intelligent framework for evaluating ecosystem restoration potential, quantifying disturbance impacts on ecological functions, and guiding the prioritization of restoration zones—marking a shift from data-driven prediction toward mechanistic ecological understanding.

Beyond these ecological insights, the model also exhibits stable optimization behavior under moderate hyperparameter variations. Theoretical analysis suggests that key hyperparameters, including the learning rate, contrastive temperature parameter (

τ

), and distribution alignment coefficient (

λ

), jointly regulate the smoothness of gradient updates and the balance between discriminability and generalization in the embedding space. Within reasonable ranges, their variations do not alter the convergence trajectory, demonstrating low sensitivity to initialization and training dynamics. Furthermore, convergence stability is not solely dependent on early stopping; rather, it arises from the regularization effect of the distribution-aligned loss, which mitigates gradient oscillations and stabilizes the optimization landscape. This property ensures that the model consistently converges to a near-optimal solution even under different training conditions, underscoring its theoretical robustness and practical reliability in ecological modeling tasks.

4.8. Limitations and Future Work

Although the proposed approach achieved satisfactory results in cross-hierarchical semantic alignment between aboveground vegetation and belowground seed banks, several limitations remain. First, the dataset is primarily derived from a limited number of regional sampling sites. Despite the model’s strong generalization across different ecological types, its scalability to larger spatial extents and more complex ecological networks remains to be further explored. Moreover, the current framework has not yet been systematically validated for transferability to other ecosystem types, such as wetlands, mangroves, or alpine tundra, where biotic and abiotic interactions may exhibit distinct structural patterns and temporal dynamics. Additionally, the temporal span and controlled conditions of seed germination experiments were relatively limited, potentially leading to underrepresentation of dormant or rare species and constraining the model’s ability to capture long-term successional trajectories. From a modeling perspective, while the disturbance-aware attention mechanism effectively captures external environmental variations, its robustness may still be challenged under extreme or compounding disturbances—such as prolonged desertification, salinization, or continuous cultivation—where nonlinear interactions among disturbance variables can induce unstable attention distributions.

Future research can address these challenges by extending the framework to multi-temporal and multi-scale ecological monitoring, integrating remote sensing imagery, hyperspectral data, and satellite-based ecosystem observations into the multimodal fusion process. Such integration would enhance the model’s sensitivity to spatial heterogeneity and temporal variability, facilitating large-scale ecological assessment and early-warning monitoring. Furthermore, the distribution-alignment strategy can be refined by incorporating generative modeling and self-supervised transfer learning to enable extrapolative reasoning across unobserved environmental domains. Combining process-based ecological models with artificial intelligence architectures also represents a promising direction for improving interpretability, scalability, and predictive reliability, ultimately advancing intelligent ecological modeling toward broader ecosystem applications and sustainable management.

5. Conclusions

This study addressed the ecological semantic alignment problem between aboveground vegetation and belowground seed banks by proposing a cross-hierarchical ecological modeling framework that integrates an ecological semantic encoder, a distribution-aligned contrastive loss, and a disturbance-aware attention mechanism. Focusing on typical grassland ecosystems, the framework utilized deep representation and multimodal fusion of multi-source ecological data to reveal the functional and structural coupling mechanisms between aboveground and belowground communities, thereby providing quantitative support for ecological restoration and ecosystem service optimization. Experimental results demonstrated that the proposed model outperformed both traditional statistical approaches and mainstream deep learning models across multiple evaluation metrics. Specifically, the top-1 matching accuracy reached 78.6%, a mean cosine similarity of 0.784 was achieved, a Jaccard similarity index of 0.512 was attained, and the overall KL divergence was reduced to 0.128, indicating superior alignment stability and distributional consistency. The model also exhibited strong performance in community structure recovery metrics, significantly enhancing the ecological interpretability of semantic embeddings.

The primary innovation of this research lies in the introduction of contrastive learning into cross-hierarchical ecological alignment for the first time, establishing a dual-level optimization mechanism that bridges individual sample representation and global distribution alignment. The ecological semantic encoder enabled deep semantic fusion between aboveground and belowground features, while the distribution-aligned loss function effectively mitigated cross-modal distribution shifts. Additionally, the disturbance-aware attention module incorporated dynamic modeling of environmental and anthropogenic disturbances, ensuring robust performance under complex ecological perturbations. This study not only expands the methodological boundaries of intelligent ecological data modeling but also provides a practical and scalable technological pathway for degraded grassland restoration, land-use planning, and ecosystem service trade-off analysis. Furthermore, it offers novel theoretical and methodological insights for future interdisciplinary research at the intersection of ecology and artificial intelligence.

Author Contributions

Conceptualization, J.P., Z.F., H.Z., J.L. and M.D.; Data curation, R.S.; Formal analysis, Y.Z.; Funding acquisition, J.L. and M.D.; Investigation, Y.Z.; Methodology, J.P., Z.F. and H.Z.; Project administration, J.L. and M.D.; Resources, H.Z., Y.L. and R.S.; Software, J.P. and Z.F.; Supervision, M.D.; Validation, Y.L. and Y.Z.; Visualization, Y.L. and R.S.; Writing—original draft, J.P., Z.F., H.Z., Y.L., Y.Z., R.S., J.L. and M.D.; J.P., Z.F. and H.Z. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 61202479.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Detailed Mathematical Derivations

This supplementary appendix provides the complete mathematical derivations that were previously condensed in the main text to enhance the transparency of the proposed framework.

Appendix A.1. Derivation of the Ecological Semantic Encoder

Let

X_{v} \in R^{n \times d_{v}}

and

X_{s} \in R^{n \times d_{s}}

denote the feature matrices of aboveground vegetation and soil seed bank modalities, respectively. Each modality is first embedded through a multi-head attention encoder:

H_{m} = softmax (\frac{Q_{m} K_{m}^{⊤}}{\sqrt{d_{k}}}) V_{m}, m \in {v, s}

where

Q_{m} = X_{m} W_{Q}

,

K_{m} = X_{m} W_{K}

, and

V_{m} = X_{m} W_{V}

. The ecological semantic embedding

E_{m}

is then obtained via layer normalization and feed-forward refinement:

E_{m} = LayerNorm (FFN (H_{m}) + H_{m})

which captures species-trait–environment associations in a shared representation space.

Appendix A.2. Distribution-Aligned Loss Derivation

To align the distributions between modalities, we define a composite objective combining contrastive alignment and distributional regularization:

L_{total} = L_{InfoNCE} + λ L_{MMD}

The InfoNCE term promotes sample-level semantic alignment:

L_{InfoNCE} = - \sum_{i = 1}^{N} log \frac{exp (sim (E_{v}^{i}, E_{s}^{i}) / τ)}{\sum_{j = 1}^{N} exp (sim (E_{v}^{i}, E_{s}^{j}) / τ)}

and the MMD term enforces distribution-level consistency:

L_{MMD} = {∥\frac{1}{n} \sum_{i} ϕ (E_{v}^{i}) - \frac{1}{n} \sum_{i} ϕ (E_{s}^{i})∥}_{H}^{2}

where

ϕ (\cdot)

maps embeddings into a reproducing kernel Hilbert space

H

, and

λ

controls the balance between local and global alignment.

Appendix A.3. Gradient Derivation for Optimization

The gradient of

L_{total}

with respect to

E_{v}

is:

\frac{\partial L_{total}}{\partial E_{v}} = \frac{\partial L_{InfoNCE}}{\partial E_{v}} + λ \frac{\partial L_{MMD}}{\partial E_{v}}

Explicitly, for the MMD component:

\frac{\partial L_{MMD}}{\partial E_{v}^{i}} = \frac{2}{n} \sum_{j} [k (E_{v}^{i}, E_{v}^{j}) - k (E_{v}^{i}, E_{s}^{j})] \frac{\partial k (E_{v}^{i}, \cdot)}{\partial E_{v}^{i}}

where

k (\cdot, \cdot)

is a Gaussian kernel. The detailed optimization steps, including learning rate scheduling and regularization strategies, are included for reproducibility.

References

Sokol, N.W.; Slessarev, E.; Marschmann, G.L.; Nicolas, A.; Blazewicz, S.J.; Brodie, E.L.; Firestone, M.K.; Foley, M.M.; Hestrin, R.; Hungate, B.A.; et al. Life and death in the soil microbiome: How ecological processes influence biogeochemistry. Nat. Rev. Microbiol. 2022, 20, 415–430. [Google Scholar] [CrossRef]
Birhanu, L.; Bekele, T.; Tesfaw, B.; Demissew, S. Soil seed bank composition and aboveground vegetation in dry Afromontane forest patches of Northwestern Ethiopia. Trees For. People 2022, 9, 100292. [Google Scholar] [CrossRef]
Al-Huqail, A.A.; Al-Harbi, H.F.; Alowaifeer, A.M.; El-Sheikh, M.A.; Assaeed, A.M.; Alsaleem, T.S.; Kassem, H.S.; Azab, O.M.; Dar, B.A.; Malik, J.A.; et al. Correlation between aboveground vegetation composition and soil seed bank of Raudhat desert habitat: A case study of Raudhat Alkhafs, Saudi Arabia. BMC Plant Biol. 2025, 25, 136. [Google Scholar]
Zhao, Y.; Li, M.; Deng, J.; Wang, B. Afforestation affects soil seed banks by altering soil properties and understory plants on the eastern Loess Plateau, China. Ecol. Indic. 2021, 126, 107670. [Google Scholar] [CrossRef]
Chen, M.; Hussain, S.; Liu, Y.; Mustafa, G.; Hu, B.; Qin, Z.; Wang, X. Responses of soil seed bank and its above-ground vegetation to various reclamation patterns. Mar. Environ. Res. 2024, 196, 106436. [Google Scholar] [CrossRef] [PubMed]
Ma, H.; Mo, L.; Crowther, T.W.; Maynard, D.S.; van den Hoogen, J.; Stocker, B.D.; Terrer, C.; Zohner, C.M. The global distribution and environmental drivers of aboveground versus belowground plant biomass. Nat. Ecol. Evol. 2021, 5, 1110–1122. [Google Scholar] [CrossRef] [PubMed]
Larson, J.E.; Suding, K.N. Seed bank bias: Differential tracking of functional traits in the seed bank and vegetation across a gradient. Ecology 2022, 103, e3651. [Google Scholar] [CrossRef]
Zhang, Y.; Wa, S.; Liu, Y.; Zhou, X.; Sun, P.; Ma, Q. High-accuracy detection of maize leaf diseases CNN based on multi-pathway activation function module. Remote Sens. 2021, 13, 4218. [Google Scholar] [CrossRef]
Huanca-Nunez, N.; Chazdon, R.L.; Russo, S.E. Trait-mediated variation in seedling performance in Costa Rican successional forests: Comparing above-ground, below-ground, and allocation-based traits. Plants 2024, 13, 2378. [Google Scholar] [CrossRef]
Lv, Y.; Shen, M.; Meng, B.; Zhang, H.; Sun, Y.; Zhang, J.; Chang, L.; Li, J.; Yi, S. The similarity between species composition of vegetation and soil seed bank of grasslands in Inner Mongolia, China: Implications for the asymmetric response to precipitation. Plants 2021, 10, 1890. [Google Scholar] [CrossRef]
Lin, X.; Wa, S.; Zhang, Y.; Ma, Q. A dilated segmentation network with the morphological correction method in farming area image Series. Remote Sens. 2022, 14, 1771. [Google Scholar] [CrossRef]
Haobo, W. Comparative learning leads weak label learning new SOTA, and Zhejiang University’s new research was selected as ICLR Oral. Heart 2022, 13, 58. [Google Scholar]
Tang, Y.; Li, H. Comparing the performance of machine learning methods in predicting soil seed bank persistence. Ecol. Inform. 2023, 77, 102188. [Google Scholar] [CrossRef]
Rosbakh, S.; Pichler, M.; Poschlod, P. Machine-learning algorithms predict soil seed bank persistence from easily available traits. Appl. Veg. Sci. 2022, 25, e12660. [Google Scholar] [CrossRef]
Khan, R.W.A.; Shaheen, H.; Islam Dar, M.E.U.; Habib, T.; Manzoor, M.; Gillani, S.W.; Al-Andal, A.; Ayoola, J.O.; Waheed, M. A data-driven approach to forest health assessment through multivariate analysis and machine learning techniques. BMC Plant Biol. 2025, 25, 915. [Google Scholar] [CrossRef] [PubMed]
Luo, S.; Ni, J.; Chen, S.; Yu, R.; Xie, Y.; Liu, L.; Jin, Z.; Yao, H.; Jia, X. Free: The foundational semantic recognition for modeling environmental ecosystems. arXiv 2023, arXiv:2311.10255. [Google Scholar]
Plohák, P.; Švehláková, H.; Stalmachová, B.; Goňo, M.; Dvorskỳ, T. Combining extraction and cultivation methods for soil seed bank analysis increases number of captured species and their similarity to above-ground vegetation. Front. Plant Sci. 2025, 15, 1500941. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Yang, D.H.; Yi, T.H.; Zhang, G.H.; Han, J.G. Eliminating environmental and operational effects on structural modal frequency: A comprehensive review. Struct. Control Health Monit. 2022, 29, e3073. [Google Scholar] [CrossRef]
Wang, L.; Jackson, D.A. Effects of sample size, data quality, and species response in environmental space on modeling species distributions. Landsc. Ecol. 2023, 38, 4009–4031. [Google Scholar] [CrossRef]
Dou, X.; Li, W.; He, Y.; Zhao, Y.; Sang, Q.; Wang, Y.; Wang, C.; Yan, Y. Assessing the Future Effectiveness of Ecological Protection and Restoration by Compiling Ecological Patterns & Services Indicators and Multi-Scenario Simulation. Environ. Sustain. Indic. 2025, 28, 100939. [Google Scholar] [CrossRef]
Borowiec, M.L.; Dikow, R.B.; Frandsen, P.B.; McKeeken, A.; Valentini, G.; White, A.E. Deep learning as a tool for ecology and evolution. Methods Ecol. Evol. 2022, 13, 1640–1660. [Google Scholar] [CrossRef]
Pichler, M.; Hartig, F. Machine learning and deep learning—A review for ecologists. Methods Ecol. Evol. 2023, 14, 994–1016. [Google Scholar] [CrossRef]
Wu, Y.; Gadsden, S.A. Machine learning algorithms in microbial classification: A comparative analysis. Front. Artif. Intell. 2023, 6, 1200994. [Google Scholar] [CrossRef]
Luo, C.; Guo, X.; Feng, C.; Xiao, C. Soil seed bank responses to anthropogenic disturbances and its vegetation restoration potential in the arid mining area. Ecol. Indic. 2023, 154, 110549. [Google Scholar] [CrossRef]
Sanou, L.; Savadogo, P.; Zida, D.; Thiombiano, A. Variation in soil seed bank and relationship with aboveground vegetation across microhabitats in a savanna-woodland of West Africa. Nord. J. Bot. 2022, 2022, e03304. [Google Scholar] [CrossRef]
Luo, C.; Guo, X.P.; Feng, C.D.; Ye, J.P.; Li, P.F.; Li, Z.T. Spatial patterns of soil seed banks and their relationships with above-ground vegetation in an arid desert. Appl. Veg. Sci. 2021, 24, e12616. [Google Scholar] [CrossRef]
Durkee, M.S.; Lleras, K.; Drukker, K.; Ai, J.; Cao, T.; Casella, G.; Ghosh, D.; Clark, M.R.; Giger, M.L. Generalizations of the Jaccard index and Sørensen index for assessing agreement across multiple readers in object detection and instance segmentation in biomedical imaging. J. Med. Imaging 2023, 10, 065503. [Google Scholar] [CrossRef]
DeMalach, N.; Kigel, J.; Sternberg, M. The soil seed bank can buffer long-term compositional changes in annual plant communities. J. Ecol. 2021, 109, 1275–1283. [Google Scholar] [CrossRef]
Borokini, I.T.; Weisberg, P.J.; Peacock, M.M. Quantifying the relationship between soil seed bank and plant community assemblage in sites harboring the threatened Ivesia webberi in the western Great Basin Desert. Appl. Veg. Sci. 2021, 24, e12547. [Google Scholar] [CrossRef]
Ray, J.; Bordolui, S.K. Role of seed banks in the conservation of plant diversity and ecological restoration. J. Environ. Sci. 2021, 3, 1–16. [Google Scholar]
Gao, J.; Yang, B.; Babu, S. Ecological links between aboveground and underground ecosystems under global change. Front. Ecol. Evol. 2024, 12, 1347653. [Google Scholar] [CrossRef]
Zhang, W.; Stratos, K. Understanding hard negatives in noise contrastive estimation. arXiv 2021, arXiv:2104.06245. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Xiong, J.; Yu, H.; Li, L.; Yuan, M.; Yu, J. Asymmetry between ecosystem health and ecological quality from an Earth observation perspective. Sci. Rep. 2025, 15, 10143. [Google Scholar] [CrossRef]
Zhou, X.; Chen, S.; Ren, Y.; Zhang, Y.; Fu, J.; Fan, D.; Lin, J.; Wang, Q. Atrous Pyramid GAN Segmentation Network for Fish Images with High Performance. Electronics 2022, 11, 911. [Google Scholar] [CrossRef]
Mir, Y.H.; Mir, S.; Ganie, M.A.; Bhat, J.A.; Shah, A.M.; Mushtaq, M.; Irshad, I. Overview of land use and land cover change and its impacts on natural resources. In Ecologically Mediated Development: Promoting Biodiversity Conservation and Food Security; Springer: Singapore, 2025; pp. 101–130. [Google Scholar]
Alemayehu, B.; Suarez-Minguez, J.; Rosette, J. The Implications of Plantation Forest-Driven Land Use/Land Cover Changes for Ecosystem Service Values in the Northwestern Highlands of Ethiopia. Remote Sens. 2024, 16, 4159. [Google Scholar] [CrossRef]
Rana, S.; Gatti, M. Comparative Evaluation of Modified Wasserstein GAN-GP and State-of-the-Art GAN Models for Synthesizing Agricultural Weed Images in RGB and Infrared Domain. MethodsX 2025, 14, 103309. [Google Scholar] [CrossRef]
Van Den Wollenberg, A.L. Redundancy analysis an alternative for canonical correlation analysis. Psychometrika 1977, 42, 207–219. [Google Scholar] [CrossRef]
Ter Braak, C.J. Canonical correspondence analysis: A new eigenvector technique for multivariate direct gradient analysis. Ecology 1986, 67, 1167–1179. [Google Scholar] [CrossRef]
Schönemann, P.H. A generalized solution of the orthogonal procrustes problem. Psychometrika 1966, 31, 1–10. [Google Scholar] [CrossRef]
Huang, P.S.; He, X.; Gao, J.; Deng, L.; Acero, A.; Heck, L. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013; pp. 2333–2338. [Google Scholar]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. Adv. Neural Inf. Process. Syst. 1993, 6, 737–744. [Google Scholar] [CrossRef]
Zhang, S.; Li, Z.; Yan, S.; He, X.; Sun, J. Distribution alignment: A unified framework for long-tail visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2361–2370. [Google Scholar]
Liu, Y.; Gao, K.; Wang, H.; Yang, Z.; Wang, P.; Ji, S.; Huang, Y.; Zhu, Z.; Zhao, X. A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104083. [Google Scholar] [CrossRef]

Figure 1. The overall framework diagram illustrates the proposed methodological pipeline.

Figure 2. The figure illustrates the overall architecture of the Ecological Semantic Encoder.

Figure 3. The figure illustrates the structural principle of the Distribution-Aligned Contrastive Loss.

Figure 4. The figure illustrates the overall architecture of the disturbance-aware attention module.

Figure 5. Overall performance comparison.

Table 1. Composition of the dataset.

Data Type	Quantity	Period	Source and Method
Aboveground vegetation records	12,500	2022–2023	Quadrat survey (1 m × 1 m) for species identification and abundance estimation
Soil seed bank samples	2500	2022–2023	0–10 cm soil cores, laboratory germination test (90 days)
Environmental variables	2500	2022–2023	GPS/DEM, soil nutrients (N, P, K, OM), meteorological data
Disturbance history	2500	2013–2023	Field interviews, remote sensing (Landsat, Sentinel-2)
Total ecological pairs	$N = 2500$	2022–2023	Integrated aboveground–belowground paired dataset

Table 2. Overall performance comparison between the proposed model and baseline methods.

Method	Top-1 Acc (%)	Top-5 Acc (%)	Mean Cosine Sim.	Jaccard Index
RDA [39]	58.4	71.2	0.621	0.413
CCA [40]	60.7	73.5	0.648	0.425
Procrustes [41]	63.2	75.8	0.667	0.437
MLP Matching [42]	68.5	81.9	0.703	0.465
Siamese Network [43]	70.1	83.4	0.716	0.471
SimCLR [33]	71.8	85.0	0.733	0.479
DisAlign [44]	75.9	87.8	0.763	0.501
TMFNet [45]	74.6	87.1	0.752	0.493
Proposed Model	78.6	89.3	0.784	0.512

Table 3. Comprehensive evaluation across alignment, distribution, and structural recovery metrics.

Method	KL Div. ↓	EMD ↓	Sørensen Coeff.	NMDS Stress ↓	Top-1 Acc (%)	Mean Cos. Sim.
RDA	0.241	0.184	0.612	0.147	58.4	0.621
CCA	0.218	0.176	0.623	0.141	60.7	0.648
Procrustes	0.201	0.169	0.636	0.136	63.2	0.667
MLP Matching	0.176	0.158	0.652	0.127	68.5	0.703
Siamese Network	0.162	0.145	0.664	0.121	70.1	0.716
SimCLR	0.157	0.141	0.671	0.118	71.8	0.733
DisAlign	0.137	0.119	0.701	0.103	75.9	0.763
TMFNet	0.145	0.132	0.689	0.111	74.6	0.752
Proposed Model	0.128	0.107	0.713	0.094	78.6	0.784

Table 4. Ablation study evaluating the contribution of each module in the proposed framework. Statistical significance is reported relative to the full model (* p < 0.05, ** p < 0.01).

Model Variant	Top-1 Acc (%)	Mean Cosine Sim.	KL Div. ↓	Jaccard Index
Without Ecological Semantic Encoder	70.9 **	0.713 **	0.182 **	0.469 **
Without Disturbance-Aware Attention	72.4 **	0.731 **	0.168 **	0.478 *
Without Distribution-Aligned Loss	74.3 *	0.751 *	0.152 *	0.491 *
Full Model (Proposed)	78.6	0.784	0.128	0.512

Table 5. Sensitivity analysis of disturbance-aware attention under different disturbance types and intensities.

Disturbance Type	Level	Mean Attention Activation	$Δ$ MMD (↓)	DSI (↑)
Grazing Intensity	Low	0.421	0.017	0.864
	Medium	0.457	0.014	0.879
	High	0.493	0.011	0.892
Fire Frequency	Low	0.476	0.021	0.874
	Medium	0.512	0.018	0.889
	High	0.549	0.016	0.901
Land-use Change	Low	0.541	0.039	0.881
	Medium	0.602	0.033	0.905
	High	0.665	0.032	0.918

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, J.; Fu, Z.; Zhou, H.; Liu, Y.; Zhang, Y.; Shi, R.; Li, J.; Dong, M. Integrating Ecological Semantic Encoding and Distribution-Aligned Loss for Multimodal Forest Ecosystem. Forests 2025, 16, 1697. https://doi.org/10.3390/f16111697

AMA Style

Peng J, Fu Z, Zhou H, Liu Y, Zhang Y, Shi R, Li J, Dong M. Integrating Ecological Semantic Encoding and Distribution-Aligned Loss for Multimodal Forest Ecosystem. Forests. 2025; 16(11):1697. https://doi.org/10.3390/f16111697

Chicago/Turabian Style

Peng, Jing, Zhengjie Fu, Huachen Zhou, Yibin Liu, Yang Zhang, Rui Shi, Jiangfeng Li, and Min Dong. 2025. "Integrating Ecological Semantic Encoding and Distribution-Aligned Loss for Multimodal Forest Ecosystem" Forests 16, no. 11: 1697. https://doi.org/10.3390/f16111697

APA Style

Peng, J., Fu, Z., Zhou, H., Liu, Y., Zhang, Y., Shi, R., Li, J., & Dong, M. (2025). Integrating Ecological Semantic Encoding and Distribution-Aligned Loss for Multimodal Forest Ecosystem. Forests, 16(11), 1697. https://doi.org/10.3390/f16111697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Ecological Semantic Encoding and Distribution-Aligned Loss for Multimodal Forest Ecosystem

Abstract

1. Introduction

2. Related Work

2.1. Modeling the Ecological Relationship Between Aboveground Vegetation and Seed Banks

2.2. Contrastive Learning and Ecological Multimodal Alignment

2.3. Land Use and Ecosystem Services Research

3. Materials and Method

3.1. Data Collection

3.2. Data Enhancement

3.3. Proposed Method

3.3.1. Overall

3.3.2. Ecological Semantic Encoder

3.3.3. Distribution-Aligned Contrastive Loss

3.3.4. Disturbance-Aware Attention Module

4. Results and Discussion

4.1. Evaluation Metrics

4.2. Experiment Settings

4.2.1. Hardware and Software Configuration

4.2.2. Hyperparameter Settings

4.2.3. Baseline Methods

4.3. Overall Performance Comparison Between the Proposed Model and Baseline Methods

4.4. Comprehensive Evaluation Across Alignment, Distribution, and Structural Recovery Metrics

4.5. Ablation Study Evaluating the Contribution of Each Module in the Proposed Framework

4.6. Disturbance Sensitivity

4.7. Discussion

4.8. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Detailed Mathematical Derivations

Appendix A.1. Derivation of the Ecological Semantic Encoder

Appendix A.2. Distribution-Aligned Loss Derivation

Appendix A.3. Gradient Derivation for Optimization

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI