Abstract
Net ecosystem exchange (NEE) is a central metric for assessing carbon cycling, and its accurate quantification is critical for understanding terrestrial-atmosphere carbon exchange dynamics. However, in complex alpine regions, high-resolution NEE estimation remains challenging due to limited observations and heterogeneous surface processes. To address this, we developed a multimodal feature fusion model (Multimodal-CNN-Attention-RF, MMCA-RF) that integrates convolutional neural networks (CNN) and random forest (RF) for NEE estimation in the Babao River Basin on the northeastern Tibetan Plateau. The model incorporates a cross-modal attention mechanism to dynamically optimize feature interactions, thereby better capturing the spatially heterogeneous responses of vegetation to environmental drivers. Results demonstrate that MMCA-RF exhibits strong stability and generalization, with R2 values of 0.89 (training) and 0.85 (testing). Based on model outputs, the Babao River Basin acted as a carbon sink during 2017–2023, with a mean annual NEE of −100.86 gC m−2 yr−1. Spatially, NEE showed pronounced heterogeneity, while seasonal variation followed a unimodal pattern. Among vegetation types, grasslands contributed the largest total carbon sink, whereas open woodlands showed the highest sequestration efficiency per unit area. Driver analysis identified temperature as the dominant control on NEE spatial variation, with interactions between temperature, precipitation, and topography further enhancing heterogeneity. This study provides a high-accuracy modeling approach for monitoring carbon cycling in alpine ecosystems and offers insights into the stability of regional carbon pools under climate change.