A Random Forest-Based CO2 Profile Emulator for Real-Time Prior Profile Generation in TanSat XCO2 Retrieval

Wu, Shaojie; Wang, Yang; Zhang, Likun; Jia, Heng; Zhang, Xianmei; Xu, Linglin; Dai, Yunxiao

doi:10.3390/rs17162764

Open AccessArticle

A Random Forest-Based CO₂ Profile Emulator for Real-Time Prior Profile Generation in TanSat XCO₂ Retrieval

by

Shaojie Wu

^1,2,

Yang Wang

^1,2,*

,

Likun Zhang

^1,2,3,

Heng Jia

^1,2,

Xianmei Zhang

^1,2,

Linglin Xu

^1,2 and

Yunxiao Dai

^1,2

¹

Institute of Geography, Fujian Normal University, Fuzhou 350007, China

²

Key Laboratory for Humid Subtropical Ecogeographical Processes of the Ministry of Education, School of Geographical Sciences, Fujian Normal University, Fuzhou 350007, China

³

519 Brigade of North China Geological Exploration Bureau, Baoding 071000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(16), 2764; https://doi.org/10.3390/rs17162764

Submission received: 30 June 2025 / Revised: 4 August 2025 / Accepted: 8 August 2025 / Published: 9 August 2025

(This article belongs to the Section Atmospheric Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Greenhouse gas monitoring satellites provide extensive observational data for the global remote sensing of atmospheric carbon dioxide (CO₂), yet a critical limitation in utilizing these data is the dependence of the full physics retrieval accuracy on a priori CO₂ profiles. This challenge is pronounced due to the significant time delay inherent in data assimilation products of high quality, whose latency prevents their use for retrieval in real time. The resulting temporal mismatch between the a priori constraint and the actual atmospheric state is a primary source of systematic bias in the retrieved CO₂. To address this issue, this paper develops a random forest-based CO₂ profile emulator (RF-CPE) with the core novelty of emulating the high-quality Carbon Tracker CO₂ profiles in real time. By learning the complex relationships between multisource features and the corresponding Carbon Tracker profiles, the emulator generates a dynamic profile specific to each observation. The application of this emulator-based approach to TanSat observations from 2017 to 2018 demonstrates significant performance gains, reducing the mean retrieval bias by 44.11% (from 2.63 ppm to 1.47 ppm) compared to using a static prior. The emulator itself exhibits high performance, with an R² of 0.71 and an RMSE of 2.13 ppm, in agreement with the Carbon Tracker data. Ultimately, this work presents a robust and computationally efficient solution that resolves the conflict between the accuracy and timeliness of a priori constraints, effectively translating the performance of a delayed assimilation system into a real-time retrieval framework to significantly enhance the reliability of satellite CO₂ monitoring.

Keywords:

ACGS/TanSat; XCO₂ retrieval algorithm; a priori profiles; random forest

1. Introduction

Greenhouse gases (GHGs) are major contributors to global warming and the increased frequency of extreme weather events [1]. As the most important GHG, the yearly increase in the concentration of atmospheric carbon dioxide (CO₂) has attracted global attention [2]. According to the WMO Greenhouse Gas Bulletin released by the World Meteorological Organization (WMO) on 28 October 2024, the globally averaged surface concentration of CO₂ reached 420.0 parts per million (ppm) in 2023, which is 1.51 times higher than in the pre-industrial (1750) period [3]. Consequently, the accurate measurement of the global column-averaged dry-air mole fraction of CO₂ (XCO₂) is crucial in monitoring carbon sources and sinks, understanding the carbon cycle, and validating carbon emissions. While ground-based observation networks can provide high-precision continuous measurements, they cannot meet the requirements for the monitoring of the carbon cycle due to the uneven distribution of global sites. In contrast, satellite remote sensing, with its extensive global coverage and high sampling frequency, has become an indispensable tool to better meet the application requirements of climate research [4].

The capability to monitor XCO₂ from space has evolved significantly over the past two decades. Early instruments, such as the SCIAMACHY sensor on ENVISAT, pioneered this field but offered limited accuracy due to the lower spectral resolution and signal-to-noise ratios [5]. In response, a new generation of dedicated satellites was developed to meet the more stringent accuracy requirements. Key missions include Japan’s Greenhouse gases Observing SATellite GOSAT series [6,7,8,9], the United States’ Orbiting Carbon Observatory OCO-2/3 missions [10,11], and China’s first carbon monitoring satellite, TanSat [12,13], which provides the primary observational data for this study. These advanced platforms have enabled the scientific community to access an unprecedented volume of high-quality spectral data, forming the foundation for modern atmospheric retrieval studies.

Retrieval algorithms applied to satellite data are generally divided into two main categories: statistical and physical methods [14]. While statistical methods have been successfully applied to retrieve CO₂ concentrations, with studies employing various techniques, from neural networks to more recent Transformer-based models [15,16,17], these approaches as a whole face limitations in areas like error characterization for data assimilation [14]. Consequently, the application of statistical methods is often focused on generating input parameters for the more dominant physical algorithms [13]. The physical approach, which forms the foundation of most operational retrieval systems, primarily includes two types of methods: differential optical absorption spectroscopy (DOAS), and, particularly for modern high-resolution sensors, full physics (FP) algorithms based on the optimal estimation method (OEM). A summary of the main physical retrieval algorithms and their properties, as developed by various international research groups, is presented in Table 1. While these physical algorithms are based on similar theoretical principles, their practical implementations reveal diverse strategies and a common challenge regarding the selection of the crucial a priori CO₂ profile. This is particularly evident in the handling of profile timeliness. For example, a notable study by Wu et al. [18] for the 2014–2016 period used the RemoTeC-FP algorithm with Carbon Tracker profiles from 2013, demonstrating a reliance on historically lagged data. Similarly, the developers of the IAPCAS algorithm, while producing high-accuracy results, initially adopted profiles based on the CT2013 model, before later transitioning to other non-concurrent model outputs like the LMDZ MACC-II model in subsequent work [19,20,21].

A closer examination of the a priori profiles used in these algorithms reveals a common reliance on data that are not concurrent with the satellite measurements. To enhance the retrieval accuracy, it is essential to use real-time prior CO₂ distributions, as these profiles serve as the initial estimates and constraints to improve the stability and accuracy of the results. In practice, however, researchers often adopt one of several strategies to approximate this ideal: some studies rely on static, climatological profiles for consistency, while others utilize outputs from complex chemical transport models or data assimilation systems like Carbon Tracker, which inherently have a significant time delay due to the extensive data collection and processing required for their production. This inherent temporal mismatch between any of these a priori constraints and the actual, dynamic atmospheric state at the moment of observation is a primary source of systematic bias, fundamentally limiting the reliability of the final XCO₂ product. Therefore, developing a method to generate accurate, observation-specific a priori profiles that can leverage a full suite of concurrently available data remains a critical and necessary step to advance this field.

To address the aforementioned challenge of non-concurrent a priori profiles, this study develops and evaluates a novel machine learning approach for the generation of real-time, observation-specific CO₂ priors. This work begins by developing a robust and efficient random forest-based CO₂ profile emulator (RF-CPE) that leverages its ability to handle the complex, non-linear relationships between multisource data. The study then applies these RF-CPE-generated profiles as dynamic, a priori constraints within a full physics retrieval framework for TanSat observations. The resulting XCO₂ product is subsequently validated through a dual framework, which first involves the rigorous validation of its numerical accuracy against ground-based TCCON measurements. As a final step, the scientific coherence of the product is further assessed by evaluating its capability to capture well-understood regional and seasonal carbon cycle dynamics in a case study over China and its neighboring regions.

This paper is organized as follows. Section 2 describes the datasets and the detailed methodology for the RF-CPE model’s development and the XCO₂ retrieval framework. Section 3 presents and discusses the results, beginning with an evaluation of the emulator’s performance, followed by the comprehensive validation of the final XCO₂ product against TCCON data and an analysis of its scientific coherence through a regional case study. Finally, Section 4 summarizes the key findings and provides the conclusions of this study.

2. Materials and Methods

2.1. Data

2.1.1. ACGS/TanSat Observations

The primary satellite data for this study were acquired from the Atmospheric Carbon Dioxide Grating Spectroradiometer (ACGS), the main payload on the TanSat satellite. The ACGS instrument is a hyperspectral spectrometer that measures solar backscattered radiation in three specific bands to retrieve CO₂: the O₂-A band around 0.76 µm, a weak CO₂ absorption band at 1.61 µm, and a strong CO₂ absorption band at 2.06 µm, with a spectral resolution of up to 0.04 nm.

For this work, we used the Level 1B Nadir-mode scientific data product (V2.0), which covers the period from 2017 to 2018 and was downloaded from the National Satellite Meteorological Center’s data service website (https://satellite.nsmc.org.cn/(accessed on 5 February 2024)) [32]. It is important to note, however, that the dataset for this period is not entirely complete due to several data gaps. For instance, data are missing for January and February 2017, and a significant loss of observations occurred in October 2018 when data processing was suspended for parameter adjustments. Furthermore, a partial loss of data was experienced in December 2018, which was attributed to the onset of ACGS sensor degradation.

2.1.2. ERA5/EMCWF

ERA5 is a meteorological reanalysis data produced by the European Centre for Medium-Range Weather Forecasts (ECMWF) that is widely used in many fields. The data can be downloaded from the official website of the ECMWF (https://www.ecmwf.int/(accessed on 22 January 2024)). The ERA5 data are the fifth generation of the ECMWF reanalysis of global climate and weather over the past eight years. ERA5 uses data assimilation techniques to integrate model simulation data with observations from around the world into a complete and consistent dataset [33,34]. ERA5 is updated daily with a delay of about 5 days, and its resolution is 0.25° × 0.25°, covering a pressure range of 1000 hPa to 1 hPa in vertical height. There are 37 levels in total, and the time resolution can be down to the hourly scale [35].

This study used hourly meteorological data from ERA5, mainly using parameters such as the temperature, humidity, total column water, and wind-u/v, and the time periods selected were 2017 and 2018.

2.1.3. MODIS/Aqua Data

Ancillary data from the Moderate-Resolution Imaging Spectroradiometer (MODIS) [36,37], an instrument on NASA’s Terra and Aqua satellites, were used in this study. All MODIS scientific products were sourced from the Aqua satellite and downloaded from the LAADS DAAC website (https://ladsweb.modaps.eosdis.nasa.gov/(accessed on 18 February 2024)). These data products are widely used in meteorological and climate research and serve as an important tool for global monitoring [38]. Four specific MODIS scientific products were utilized to provide essential information for cloud screening, atmospheric states, and surface properties.

For the purpose of cloud screening, the CLDMSK_L2_MODIS_Aqua product was utilized. This cloud mask product, generated from version 6.1 of the MODIS sensor’s L1B data, provides cloud information for each pixel with strict quality control. In this study, it was used to filter the TanSat observation data, thereby eliminating data affected by clouds and preventing the contamination of the retrieval results.

Additionally, atmospheric aerosol information was provided by the Aerosol Optical Depth (AOD) product, MYD04_L2. The AOD is the vertical integral of the aerosol extinction coefficient and is the most widely used aerosol remote sensing parameter [39,40]. The MYD04_L2 product, with a spatial resolution of 10 km, was used in this study to provide the atmospheric AOD during the CO₂ retrieval process and as a key input feature for the construction of the RF-CPE model.

To characterize the properties of the land surface, two further MODIS products were incorporated as inputs for the RF-CPE model. The first was the Normalized Difference Vegetation Index (NDVI), a metric indicating the state and health of terrestrial ecosystems. This was obtained from the MYD13A3 dataset, which has a spatial resolution of 1 km and a 16-day product interval reflecting seasonal and interannual vegetation changes. The second surface product was reflectance, which was provided by the high-quality MYD09GA dataset at a 500 m spatial resolution.

2.1.4. Carbon Tracker Profile

Carbon Tracker, developed by the National Oceanic and Atmospheric Administration (NOAA), is a global data assimilation system that estimates carbon cycle dynamics. It combines atmospheric observations with model simulations to infer CO₂ sources and distributions [41]. The system utilizes a two-level nested grid, which consists of a global grid with a resolution of 3° longitude by 2° latitude and a finer 1° × 1° grid over North America. Within this framework, it provides atmospheric CO₂ profile data across 34 vertical layers for 8 time periods every day [42].

In this study, the Carbon Tracker data product for the 2017–2018 period served two critical functions. Primarily, it was used as the target variable to train the random forest-based emulator model. Additionally, these data were used to calculate the a priori error covariance matrix required for the full physics retrieval algorithm.

2.1.5. TCCON Measurements

Remote sensing authenticity validation is an important part of remote sensing retrieval and the only means to evaluate the quality, reliability, and applicability of remote sensing products [43]. For this purpose, this study used the Total Carbon Column Observing Network (TCCON) data as ground-based validation data, downloaded from the TCCON website (https://tccondata.org/(accessed on 3 March 2024)). The TCCON uses high-precision ground-based Fourier transform spectrometer (FTS) to monitor the spectral radiation in the range of 4000–9000 cm⁻¹ in solar radiation. Compared with satellite detection data, the TCCON’s retrieval method is more accurate than space-based instruments, and it can effectively avoid errors caused by aerosols, cirrus clouds, etc. The column concentration data of CO₂, CH₄, N₂O, HF, CO, and H₂O retrieved from the spectrum yield high-precision results [44]. Therefore, the XCO₂ data product of the TCCON site has been recognized as the validation standard for satellite remote sensing data and is widely used to validate the accuracy of satellite data. Taking into account data coverage and completeness for the study period, observation data from 14 TCCON sites were selected as validation data for our retrieval results. Detailed information for these sites, which was obtained from the official TCCON data portal, is provided in Table 2.

2.2. RF-CPE Model Development

2.2.1. Sensitivity Analysis and Feature Selection

Sensitivity analysis is a critical step in satellite remote sensing retrieval, serving the dual purpose of identifying key variables that influence the retrieval results and informing the selection of optimal input features for a machine learning model to mitigate the risk of overfitting [45,46]. This study focuses its analysis on the weak CO₂ absorption band centered around 1.61 µm, which serves as the primary source of information in retrieving the total column abundance of CO₂. A key advantage of this spectral region is the high signal-to-noise ratio (SNR), as the combination of strong solar radiance and relatively low absorption from interfering gases ensures that a high signal level is available to the sensor. The term “weak”, used to describe the CO₂ absorption features in this band, is relative, and a critical advantage of these non-saturating features is that they allow the band to be sensitive to CO₂ concentration changes throughout the entire atmospheric column, particularly in the lower troposphere, where most variability occurs. In contrast, the other two bands measured by the ACGS instrument serve complementary roles in the retrieval process. The O₂ A band around 0.76 µm is primarily used to retrieve information about the atmospheric light path length, aerosols, and clouds, while the strong CO₂ absorption band around 2.06 µm offers further constraints on aerosol properties [47,48]. Consequently, a single-factor sensitivity analysis was conducted to evaluate how changes in aerosol optical depth (AOD), surface reflectance, temperature, and water vapor impact the simulated radiance of this specific band. To perform this analysis, this study utilized the SCIATRAN [49] radiative transfer model (v4.6.1) to simulate the top-of-atmosphere spectra under various defined scenarios.

Atmospheric aerosol scattering is a primary source of error in XCO₂ retrievals from reflectance spectra, as high aerosol loads can alter the atmospheric optical path length, which complicates radiative transfer modeling and introduces significant bias into the retrieved XCO₂ [50]. Given the complex and highly variable nature of aerosols, we first simulated the radiance spectra of the weak CO₂ absorption band under six different AOD scenarios, with values ranging from 0.1 to 0.6, to quantify this effect. As shown in Figure 1, an increase in AOD leads to a corresponding decrease in the simulated spectral radiance. The results indicate that, for every 0.1 increase in AOD, the radiance in the weak CO₂ band decreases by approximately 2.5%.

Surface reflectance is another significant source of error in the XCO₂ retrieval process, primarily due to the complex bidirectional reflectance distribution function (BRDF) [51] characteristics of natural surfaces. To assess the impact of this variability, a range of surface reflectance values from 10% to 50% was selected for the sensitivity analysis. This range was designed to be representative of the typical reflectance values of most terrestrial surfaces observed by satellites like TanSat, encompassing darker vegetated areas as well as brighter, arid, or sparsely vegetated landscapes. Using these values, the radiance spectra in the weak CO₂ absorption band were simulated, and the results are presented in Figure 2. The results clearly demonstrate that surface reflectance has a significant effect on the spectral radiance. The simulated radiance at a surface reflectance level of 50% is three times greater than that at 15%, highlighting the necessity of accurately characterizing the surface in the retrieval algorithm.

The temperature directly affects the strength, linearity, and even the positions of CO₂ molecular absorption lines [52]. Furthermore, the atmospheric temperature varies significantly with the geolocation, season, and weather, thus impacting the CO₂ absorption spectra observed by satellites [53]. Given this dual influence, quantifying the impact of temperature uncertainty on the observed radiance is essential. Therefore, this study conducted a sensitivity analysis using the 1976 U.S. Standard Atmosphere (USSA-1976) as the baseline temperature profile. We then simulated the spectral radiance under several perturbation scenarios by uniformly altering the entire temperature profile by ±1 K, ±2 K, and ±3 K. The results are shown in Figure 3, where the radiance change rate is defined as the percentage difference between the spectrum simulated with a perturbed profile and the spectrum simulated with the baseline profile. The results indicate a band-dependent effect on the radiance. Notably, within the sub-regions of this band where CO₂ absorption is relatively strong, a temperature perturbation of just a few Kelvins can alter the simulated radiance by up to 4%.

Finally, a sensitivity analysis for water vapor was performed. Although water vapor absorption is relatively weak in the 1.61 µm spectral region, its high concentration and variability in the atmospheric boundary layer necessitate an evaluation of its impact on XCO₂ retrieval. Following the same methodology, the 1976 U.S. Standard Atmosphere was used as the baseline, and simulations were run with water vapor profile perturbations of ±5%, ±10%, and ±15%. As illustrated in Figure 4, the simulated spectral radiance decreases as the water vapor content increases. The results show that a 15% perturbation in water vapor content can lead to a radiance change of approximately 2% in some parts of the band.

The final selection of input features for the RF-CPE model was based on a comprehensive strategy that combined a physics-based sensitivity analysis with the inclusion of fundamental physical drivers of CO₂ variability. The preceding sensitivity analysis confirmed that key parameters known to be significant sources of error in physical retrievals, such as the AOD, surface reflectance, temperature, and total column water vapor, had a substantial and direct impact on the simulated radiance of the weak CO₂ band and were therefore included. To provide the model with a broader geophysical context, additional features were selected based on their well-established roles as drivers of CO₂ concentrations and transport. These included meteorological characteristics such as the surface pressure, humidity, and wind vectors, as well as the Normalized Difference Vegetation Index (NDVI) as a proxy for biospheric activity. Finally, the TanSat radiance spectra in the weak CO₂ absorption band were included as the most direct observational feature.

This comprehensive set of ten features, detailed in Table 3, was designed to provide the model with a holistic view of both the radiative transfer process and the underlying geophysical state, mitigating the risk of overfitting by ensuring that all selected features had a strong physical basis.

2.2.2. Emulator Training and Feature Importance

In this study, the CO₂ profile emulator (RF-CPE) was developed using the random forest (RF) algorithm. RF is a machine learning method based on the idea of ensemble learning. Building on decision trees, it improves the performance and stability of the overall model by combining multiple tree outputs. The RF method was specifically chosen for this geophysical application due to a combination of its performance, robustness, and interpretability. It excels in capturing the complex, non-linear relationships between diverse input features (i.e., satellite spectra, meteorological data, and surface properties) and the target CO₂ profiles. Furthermore, it is generally robust against overfitting and less sensitive to the scale of input features, reducing the need for extensive pre-processing. A final, crucial advantage is its inherent ability to calculate feature importance, providing valuable scientific insights into the physical drivers learned by the model.

The RF-CPE was implemented as a regression model, using the high-dimensional feature data presented in Table 3 as predictors and the Carbon Tracker profile product as the target variable. A multifaceted strategy was employed to ensure the model’s robustness and mitigate the risk of overfitting. A primary measure was the feature selection process itself, which, as detailed previously, was guided by a physics-based sensitivity analysis to ensure a strong physical basis for all inputs. The choice of the random forest algorithm also provided inherent protection, as its ensemble learning nature is designed to improve generalization. Specifically, the model constructs each decision tree from a bootstrapped subsample of the training data (bagging) and considers only a random subset of features at each split [54]. This dual-randomization process prevents individual trees from overrelying on specific noise patterns, and, because the final prediction is the average of all tree outputs, random errors are effectively cancelled out. Furthermore, direct regularization was applied during training by constraining key hyperparameters, such as limiting the maximum depth of each tree to 15 levels. As a final measure, the complete dataset was partitioned into a training set (80%) and a hold-out testing set (20%) for an independent evaluation of the model’s generalization ability.

After training, an analysis of the feature importance rankings was conducted (Figure 5), revealing that the CO₂ absorption spectrum, wind speed, and humidity were the top three most influential features. This finding confirms that the spectral information is the model’s strongest single driver, while also indicating that meteorological factors governing regional CO₂ transport play a critical role. Once trained, these learned relationships allow the RF-CPE model to generate real-time CO₂ profiles for new soundings, which then serve as the dynamic a priori constraints for the subsequent full physics retrieval.

2.3. XCO₂ Retrieval Framework

2.3.1. Full Physics Retrieval Algorithm

Once the RF-CPE model is trained, it can generate dynamic, real-time, a priori CO₂ profiles for each TanSat sounding. These emulated profiles are then integrated into the physical retrieval algorithm, serving both as an input to the forward model and as the a priori constraint for the retrieval. The overall workflow of this integrated framework is illustrated in Figure 6. The framework consists of two main components: the right panel of the figure details the development of the RF-CPE model, a statistical process that begins with multisource data input, followed by sensitivity analysis and feature selection to train the random forest algorithm; the left panel shows how the output from this trained emulator, which is the real-time CO₂ profile, is then used as a dynamic a priori profile to initialize and constrain the iterative physical retrieval loop.

The forward model seeks to obtain the simulated spectrum based on the state vector input and high-precision radiation transfer calculation. This study selected the SCIATRAN model v4.6.1 as the forward model to simulate the atmospheric radiation transfer process and mainly used the

i n t

and

w f

modes of the model to calculate the simulated spectrum and the Jacobian matrix of related parameters.

The retrieval algorithm used in this study is the optimal estimation method proposed by Rodgers [55]. It uses certain prior conditions as constraints in the retrieval process and completes the atmospheric CO₂ retrieval based on Bayesian maximum posteriori probability estimation. The relationship between the TanSat observed spectrum y and the state vector x to be retrieved can be expressed as

y = F (x, b) + ϵ

(1)

where y represents the radiance spectrum of the CO₂ weak absorption band;

F (x)

represents the forward model; x is the state vector; b represents the auxiliary parameters;

ϵ

represents the instrument noise and the error of the forward model. Since

F (x)

is a non-linear function, solving x belongs to the first kind of Fredholm integral equation, so x can only be accurately retrieved under certain prior knowledge and constraints.

The problem of retrieving the state vector under the constraints of prior conditions is equivalent to finding the minimum value of the cost function. Therefore, it is necessary to construct a cost function

χ^{2}

, which represents the difference between the observed spectrum y and the simulated spectrum

F (x)

and the difference between the retrieved state vector and the prior state vector. It can be defined as

χ^{2} = {(F (x) - y)}^{T} S_{ϵ}^{- 1} (F (x) - y) + {(x - x_{a})}^{T} S_{a}^{- 1} (x - x_{a})

(2)

where

S_{ϵ}

is the measurement error covariance matrix;

S_{a}

is the a prior error covariance matrix. The superscripts T and −1 represent the transpose and inverse of the matrix, respectively.

The measurement error covariance matrix

S_{ϵ}

refers to the covariance matrix used to describe the satellite observation error [56]. Satellite observation errors mainly come from uncertainties in the satellite observation system, atmospheric model, and data processing. Therefore, the measurement error covariance matrix mainly includes instrument noise, system errors, and forward model errors. It is generally simplified to a diagonal matrix during calculation. If it is assumed that the measurement errors of each band of the instrument are independent of each other and have a certain proportional relationship with the measurement values of the corresponding bands, the measurement error covariance matrix can be expressed as

S_{ϵ} = σ_{y}^{2} V_{ϵ}

(3)

where

V_{ϵ}

is the unit matrix;

σ_{y}^{2}

is the measurement variance of each band, which is defined as

{(y / S F)}^{2}

in this study.

S F

is the proportionality factor, which represents the quantification degree of the observation results to the factors that may cause measurement errors, such as observation instrument noise, atmospheric interference, and weak-intensity signals. In an ideal laboratory environment, its physical meaning is similar to the instrument signal-to-noise ratio.

The a priori error covariance matrix

S_{a}

is one of the manifestations of the prior knowledge of the CO₂ concentration distribution. It is used to describe the uncertainty of the distribution of CO₂ concentrations at different altitude levels. Its diagonal elements are called variance, which corresponds to the expected ranges of variation in the values of each parameter in the state vector; the off-diagonal elements represent covariance, which corresponds to the degree of correlation between the parameters. This study used Carbon Tracker profile data to calculate the a priori error covariance matrix. Considering that the local time of the ascending node of the carbon satellite is 13:30, the Carbon Tracker profile data at 13:30 are selected to calculate the covariance matrix. Assuming that n profile samples are selected, the covariance matrix

X_{c o v}

can be expressed as

X_{c o v} = [\begin{matrix} x_{1, 1} & \dots & x_{1, p} \\ ⋮ & ⋱ & ⋮ \\ x_{p, 1} & \dots & x_{p, p} \end{matrix}]

(4)

where element x of the covariance matrix is

x_{i, j} = \frac{1}{n} \sum_{k = 1}^{n} (x_{k, i} - \bar{x_{i}}) (x_{k, j} - \bar{x_{j}})

(5)

where n represents the number of profile samples,

x_{k, i}

represents the concentration value of the i-th layer of the k-th profile,

\bar{x_{i}}

represents the mean concentration of the i-th layer, the subscripts i and j both represent the number of profile layers, and p is the total number of profile layers.

Considering that the retrieval problem is non-linear, iteration is required to find the state vector that satisfies the prior constraints and minimizes the cost function. Since the traditional means of solving this problem has many shortcomings, this study used the Levenberg–Marquardt method to minimize the cost function. The state vector updated in each iteration is calculated as

[(1 + γ) S_{a}^{- 1} + K_{i}^{T} S_{ε}^{- 1} K_{i}] d x_{i + 1} = \{K_{i}^{T} S_{ε}^{- 1} [y - F (x_{i})] + S_{a}^{- 1} (x_{a} - x_{i})\}

(6)

where

γ

represents the Levenberg–Marquardt parameter, i represents the number of iterations, and K is the weight function matrix (Jacobian matrix). The Jacobian matrix K is usually used to represent the linear approximation of a multivariable real-valued function relative to its variables and is composed of all first-order partial derivatives of the function. The matrix elements can be expressed as

K = [\begin{matrix} \frac{ϑ y_{1}}{ϑ x_{1}} & \dots & \frac{ϑ y_{1}}{ϑ x_{n}} \\ ⋮ & ⋱ & ⋮ \\ \frac{ϑ y_{m}}{ϑ x_{1}} & \dots & \frac{ϑ y_{m}}{ϑ x_{n}} \end{matrix}]

(7)

where y represents the radiance of each band, x represents the CO₂ concentration of each layer of the atmosphere, m represents the number of spectral points, and n represents the number of layers of the atmospheric profile.

2.3.2. Validation Strategy

The performance and scientific validity of the final XCO₂ product were rigorously evaluated through a dual validation framework, consisting of a quantitative accuracy assessment against TCCON data and a qualitative analysis of the retrieved spatiotemporal patterns.

To quantify the numerical accuracy, the retrieval results were firstly validated against ground-based measurements from the TCCON. This quantitative validation was conducted using two distinct spatiotemporal collocation criteria. A stringent criterion (±0.5° and ±1 h) was used for a controlled parallel experiment to directly compare the performance of the RF-CPE-based priors against conventional static priors. Subsequently, a relaxed criterion (±5° and ±1 h) was employed to increase the sample size for a more comprehensive performance analysis of the RF-CPE optimized algorithm.

Second, to assess the scientific coherence of the retrieved data, a case study was performed by analyzing the spatiotemporal distribution of XCO₂ over China and its neighboring regions. This qualitative validation aimed to evaluate the ability of the globally applied retrieval framework to capture well-understood regional and seasonal carbon cycle dynamics.

3. Results and Discussion

3.1. RF-CPE Emulator Performance and Analysis

Before its application in the retrieval framework, the performance of the trained RF-CPE model was rigorously evaluated by comparing its output against the held-out test set from the Carbon Tracker product. This altitude-resolved validation, presented in Figure 7, assessed the model’s ability to capture the vertical distribution of CO₂. The results reveal a clear distinction in performance between different atmospheric levels. In the upper atmosphere, where CO₂ is well mixed and less variable, the emulated profiles closely match the Carbon Tracker reference, with both the RMSE and standard deviation (Std) remaining around 1 ppm. In contrast, the model exhibits substantially larger uncertainty in the near-surface layers, where the RMSE and Std increase to approximately 4.5 ppm. This discrepancy is attributable to the greater spatiotemporal variability in CO₂ in the boundary layer, which is heavily influenced by complex interactions with local sources and sinks and dynamic meteorological conditions, making it inherently more challenging to model. To further assess the model’s reliability in these critical lower layers, a detailed comparison for the four lowest atmospheric levels is presented in Figure 8. These plots indicate that, while there is a slight tendency for underestimation in the near-surface layers, the fitting errors remain within an acceptable range. Summarizing the performance across all atmospheric levels, the RF-CPE model achieves a total R² of 0.71 and an overall RMSE of 2.13 ppm against the test set, demonstrating its general reliability for use in the retrieval framework.

In addition to the accuracy, the operational feasibility of the RF-CPE model was evaluated in terms of its computational cost. The model training was a one-time process performed on a computer equipped with at least 16 GB of RAM and an NVIDIA RTX 3060 GPU or higher. Using the complete 2017–2018 dataset, the training phase required approximately 3 h to complete. However, the key advantage of the emulator approach is that this intensive training is decoupled from the time-critical retrieval process. Once trained, the RF-CPE model is saved and can be loaded for direct application.

Subsequently, the prediction time in generating a new CO₂ profile from a single TanSat sounding is highly efficient. While the exact processing time depends on the hardware configuration and the implementation of parallel computing, it is negligible when compared to the time required for the forward model to perform a single radiative transfer calculation within the retrieval loop. This computational efficiency confirms that the RF-CPE is highly suitable for operational processing chains, as it not only provides a real-time a priori profile but also has the potential to significantly accelerate the overall retrieval process.

3.2. Improvement in XCO₂ Retrieval Accuracy

The results of the controlled parallel experiment, validated against 78 collocated TCCON soundings using the stringent criteria, are presented in Figure 9. This figure, which plots the retrieval biases of both the RF-CPE-based and static-prior approaches, reveals a substantial improvement with the new method. Retrievals using the conventional static prior exhibited a large mean bias of 2.63 ppm against the TCCON. In contrast, retrievals incorporating the dynamic RF-CPE profiles achieved a significantly lower mean bias of 1.47 ppm. This represents a 44.11% reduction in systematic bias, clearly demonstrating that the real-time, observation-specific priors provide a more effective constraint for the physical retrieval algorithm and substantially enhance the accuracy of the final XCO₂ product.

3.3. Validation Against TCCON and Analysis of Regional Biases

To evaluate the overall performance of the RF-CPE-based retrieval framework with a larger dataset, validation was performed using the relaxed collocation criteria, which yielded 761 matched soundings between the satellite retrievals and the selected TCCON sites. The results of this global validation, when grouped by continent for analysis, are summarized in Table 4. While the retrievals show strong overall performance with a total R² of 0.76 and an RMSE of 1.99 ppm, the results when grouped by continent reveal clear differences in accuracy. Oceania exhibited the highest accuracy, with an RMSE of 0.77 ppm, followed by Asia at 1.85 ppm, whereas North America and Europe showed the largest errors, with RMSEs of 2.26 ppm and 2.06 ppm, respectively. To illustrate the sources of these intercontinental variations, a site-by-site validation is presented in Figure 10. This detailed analysis further reveals that, while the performance is varied across Asian and European sites, a distinct pattern of consistent overestimation can be observed across all four North American stations.

Previous studies have shown that inappropriate neighborhood definitions can introduce representativeness bias when validating the retrieval XCO₂ against high-precision ground-based observations [57]. Such representativeness bias becomes particularly evident at the Caltech site, located in the Los Angeles metropolitan area. To quantitatively assess how surface heterogeneity may contribute to this bias, we used MODIS data [58] to quantify the fractions of different land cover types within a ±5° spatial domain centered on the Caltech site(Figure 11). The analysis reveals that the area occupied by urban and built-up land cover is relatively small compared to surrounding natural types such as open shrublands, grasslands, and barren or sparsely vegetated lands. This suggests that, although Caltech is situated in a densely urbanized environment, the broader spatial context used for satellite-based validation is dominated by non-urban, high-reflectance surfaces. Such spatial averaging introduces a specific form of representativeness bias, wherein the localized urban CO₂ signal is diluted by adjacent bright natural landscapes.

Importantly, this bias not only reflects a spatial mismatch but also amplifies known retrieval artifacts. Previous OCO-2 observations over the nearby Edwards region have revealed a spurious positive dependence of the retrieved XCO₂ on surface brightness, with higher values obtained over brighter surfaces [59]. Given that many of the surrounding land types, particularly barren or sparsely vegetated areas, exhibit high surface reflectance, the incorporation of these surfaces within the ±5° validation window is likely to enhance the radiance-driven retrieval biases. Therefore, the systematic overestimation of XCO₂ at the Caltech site can be attributed to the compound effect of representativeness bias and surface reflectance-induced retrieval errors. The representativeness bias introduces high-reflectance surfaces into the retrieval footprint, while the radiance-driven bias causes these surfaces to yield spuriously elevated XCO₂ estimates. Together, these mechanisms obscure the localized urban signal and lead to the systematic inflation of the satellite-derived XCO₂ in this region.

Beyond the site-specific representativeness issues exemplified by the Caltech case, the validation results also highlight the impact of other well-understood retrieval challenges associated with difficult observation conditions. This is particularly evident at high-latitude locations like the East Trout Lake site in Canada, which tested the algorithm’s performance under very high solar zenith angles (SZAs) and high air masses. These conditions are known to reduce signal-to-noise ratios, which amplifies radiometric uncertainties, while the elongated light paths increase the sensitivity to aerosol and cloud scattering errors. A different challenge is presented by sites like Park Falls and Lamont, which, despite having relatively uniform surface properties, are subject to significant seasonal ground cover changes [60]. Such abrupt transitions in surface reflectance—for example, during snowmelt or between leaf-on and leaf-off periods—require the robust handling of the surface BRDF characteristics within the retrieval algorithm to avoid the introduction of spurious XCO₂ variations.

The detailed analysis of these North American sites demonstrates that local spatial representativeness is a key driver of retrieval uncertainty. This highlights a crucial point: while the validation provides confidence in the algorithm’s performance at these specific locations, assessing its true global applicability is inherently constrained by the sparse and uneven geographical distribution of the TCCON network itself.

This limitation means that the model’s performance in vast, undersampled regions, particularly South America and Africa, where no TCCON sites were available for this study, remains unverified. However, it is important to contextualize the confidence in the model’s general applicability. The 14 TCCON sites selected for this study, while geographically imbalanced, are not randomly placed but are instead strategically located to capture a wide range of global conditions, including diverse climate zones, surface types, and aerosol loadings, from the tropics to high northern latitudes. Most importantly, the central conclusion that the RF-CPE method yields a substantial (44.11%) reduction in retrieval bias compared to conventional methods was proven under the most stringent collocation criteria. This demonstrates a fundamental improvement in the retrieval methodology itself. Therefore, while a denser and more balanced validation network would be required to create a comprehensive global error map, the consistent performance across a variety of representative global conditions, combined with the clear superiority shown in the controlled experiment, provides strong confidence in the general applicability of the proposed RF-CPE framework.

3.4. Regional Spatiotemporal Distribution of XCO₂

While the validation against the TCCON data confirms the pointwise accuracy of the retrievals, it is also crucial to assess whether the globally applied algorithm can produce scientifically coherent spatiotemporal patterns. Therefore, this section describes a case study to evaluate the ability of the retrieval framework to capture well-understood regional carbon cycle dynamics. For this purpose, we analyzed the monthly averaged atmospheric CO₂ distribution over China and neighboring regions from the global TanSat retrievals for 2017–2018 (Figure 12 and Figure 13).

The retrieved XCO₂ distributions reveal consistent and scientifically coherent spatiotemporal patterns. A dominant seasonal cycle, consistent with the Northern Hemisphere carbon cycle dynamics, was observed across the region for both years. Concentrations typically build up to a peak in the spring months before a significant drawdown occurs through the summer, which is primarily attributed to vigorous photosynthetic uptake by the terrestrial biosphere [61,62,63]. With the arrival of autumn and winter, concentrations rise again as photosynthesis weakens and anthropogenic emissions become more dominant [64,65].

This seasonal cycle, however, manifests differently across various geographical areas. Within China, a more pronounced seasonal amplitude is evident in the northern regions, where higher vegetation coverage leads to a stronger summer carbon sink effect. In contrast, the more densely populated and industrialized eastern region exhibits less seasonal variability, a pattern likely attributable to the masking effect of strong, persistent anthropogenic emissions [66]. Beyond China’s borders, a particularly notable and recurring feature is apparent in the spring maps (Figure 12 and Figure 13), where a region of elevated XCO₂ concentrations emerges in the boreal forests of Southern Siberia. This early spring emission peak is likely attributable to the distinct phenology of high-latitude ecosystems, where rising temperatures stimulate strong ecosystem respiration from thawed soils before photosynthetic uptake by vegetation has fully commenced [67].

The analysis of the spatiotemporal distribution over China and neighboring regions serves as a final, complementary validation step that, when combined with the rigorous TCCON comparison, provides a comprehensive assessment of the new method’s performance. While the TCCON validation confirms the pointwise numerical accuracy of the retrievals, this regional case study demonstrates the algorithm’s ability to produce scientifically coherent spatiotemporal patterns, such as the seasonal drawdown driven by the biosphere.

A critical consideration for the model’s global applicability, however, is the context of its training data. While the global dataset used for training, consisting of TanSat radiance spectra from 2017–2018 and corresponding atmospheric and surface properties, is extensive, no training dataset can exhaustively cover all possible global scenarios. Consequently, the model’s performance has not been explicitly verified in environments with extreme or unique characteristics that were likely underrepresented in the training data. These potentially challenging environments include the dense tropical forests of the Amazon Basin, the persistent heavy aerosol loading from biomass burning in Central Africa, and the vast, bright desert surfaces of the Sahara. The application of this framework to such “out-of-distribution” scenarios may therefore require further validation or regional fine-tuning to ensure optimal performance, which is an important avenue for future research.

4. Conclusions

This study aimed to address the critical challenge of time delays in a priori profiles, which limits the accuracy of satellite-based XCO₂ retrievals. The primary objective was to develop and validate a machine learning-based approach, the random forest-based CO₂ profile emulator (RF-CPE), capable of generating dynamic, real-time a priori profiles for the full physics retrieval algorithm from TanSat observations.

The developed RF-CPE model was successfully applied to the global TanSat dataset for 2017–2018, and its effectiveness was demonstrated through a comprehensive validation framework. The validation first confirmed the high fidelity of the RF-CPE model itself in reproducing the target Carbon Tracker profiles, achieving an overall R² of 0.71 and an RMSE of 2.13 ppm against the independent test set. Most importantly, the application of these RF-CPE-generated priors resulted in a substantial improvement in retrieval accuracy. A controlled experiment under stringent collocation criteria showed that the mean retrieval bias against the TCCON data was reduced by a significant 44.11 percent, from 2.63 ppm to 1.47 ppm, compared to using a conventional static prior. Furthermore, the finalized retrieval product demonstrated robust overall performance in a broader validation across the 14 globally distributed TCCON sites, yielding a total R² of 0.76 and an RMSE of 1.99 ppm. As a final confirmation of its validity, the case study of the retrieved spatiotemporal distribution over China and its neighboring regions demonstrated the algorithm’s ability to capture scientifically coherent carbon cycle dynamics, such as the seasonal biospheric drawdown.

In conclusion, this work demonstrates that the RF-CPE provides a robust and computationally efficient solution to the timeliness problem of a priori constraints, enhancing the reliability of satellite CO₂ monitoring. While the study highlights that the retrieval accuracy is still sensitive to site-specific spatial representativeness and the limitations of the current validation network, the significant improvements in both numerical accuracy and scientific coherence provide strong confidence in the method’s general applicability. Future work will focus on developing adaptive collocation criteria and exploring the coordinated retrieval of other atmospheric parameters. Furthermore, on the methodological front, future work will involve investigating and benchmarking the performance of other advanced machine learning techniques, such as gradient boosting and neural networks, to potentially augment the emulator’s precision and overall robustness.

Author Contributions

Conceptualization, Y.W. and L.Z.; methodology, L.Z.; software, H.J.; validation, X.Z. and L.X.; formal analysis, S.W. and L.Z.; investigation, S.W. and L.Z.; resources, S.W. and H.J.; data curation, S.W. and Y.D.; writing—original draft, S.W.; writing—review and editing, Y.W., X.Z., L.X. and Y.D.; visualization, L.Z. and H.J.; supervision, Y.W.; project administration, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fujian Provincial Public-Interest Scientific Institution Basal Research Fund (2024R1039).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw satellite observations and reanalysis data used in this study are available from their respective public archives as cited in the manuscript. The derived data products generated during this study, including the trained RF-CPE model and the final XCO₂ retrieval results, are available on reasonable request from the corresponding author. These datasets are not publicly archived at this time due to their large volume and the complex, study-specific nature of the processing chain.

Acknowledgments

We gratefully acknowledge the University of Bremen for kindly providing the SCIATRAN v4.6.1 software. We also wish to express our gratitude to the National Satellite Meteorological Center for supplying the TanSat L1B Nadir-mode scientific data, the TCCON for the XCO₂ measurements, NASA for the MODIS/Aqua data, the NOAA for the Carbon Tracker data, and the ECMWF for the ERA5 meteorological reanalysis. Finally, we extend our heartfelt thanks to the reviewers and editors for their constructive comments and valuable suggestions, which greatly improved this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tierney, J.E.; Poulsen, C.J.; Montañez, I.P.; Bhattacharya, T.; Feng, R.; Ford, H.L.; Hönisch, B.; Inglis, G.N.; Petersen, S.V.; Sagoo, N.; et al. Past climates inform our future. Science 2020, 370, eaay3701. [Google Scholar] [CrossRef]
Nukusheva, A.; Ilyassova, G.; Rustembekova, D.; Zhamiyeva, R.; Arenova, L. Global warming problem faced by the international community: International legal aspect. Int. Environ. Agreements Politics Law Econ. 2021, 21, 219–233. [Google Scholar] [CrossRef]
World Meteorological Organization. The State of Greenhouse Gases in the Atmosphere Based on Global Observations Through 2023; Technical Report 20; World Meteorological Organization: Geneva, Switzerland, 2024.
Ye, H.; Shi, H.; Li, C.; Wang, X.; Xiong, W.; An, Y.; Wang, Y.; Liu, L. A Coupled BRDF CO₂ Retrieval Method for the GF-5 GMI and Improvements in the Correction of Atmospheric Scattering. Remote Sens. 2022, 14, 488. [Google Scholar] [CrossRef]
Liu, Y.; Wang, J.; Che, K.; Cai, Z.; Yang, D.; Wu, L. Satellite remote sensing of greenhouse gases: Progress and trends. Natl. Remote Sens. Bull. 2021, 25, 53–64. [Google Scholar] [CrossRef]
Kuze, A.; Suto, H.; Nakajima, M.; Hamazaki, T. Thermal and near infrared sensor for carbon observation Fourier-transform spectrometer on the Greenhouse Gases Observing Satellite for greenhouse gases monitoring. Appl. Opt. 2009, 48, 6716–6733. [Google Scholar] [CrossRef]
Nakajima, M.; Suto, H.; Yotsumoto, K.; Shiomi, K.; Hirabayashi, T. Fourier transform spectrometer on GOSAT and GOSAT-2. In Proceedings of the International Conference on Space Optics—ICSO 2014, Tenerife, Spain, 6–10 October 2014; Sodnik, Z., Cugny, B., Karafolas, N., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2017; Volume 10563, p. 105634O. [Google Scholar] [CrossRef]
Suto, H.; Kataoka, F.; Kikuchi, N.; Knuteson, R.O.; Butz, A.; Haun, M.; Buijs, H.; Shiomi, K.; Imai, H.; Kuze, A. Thermal and near-infrared sensor for carbon observation Fourier transform spectrometer-2 (TANSO-FTS-2) on the Greenhouse gases Observing SATellite-2 (GOSAT-2) during its first year in orbit. Atmos. Meas. Tech. 2021, 14, 2013–2039. [Google Scholar] [CrossRef]
Imasu, R.; Matsunaga, T.; Nakajima, M.; Yoshida, Y.; Shiomi, K.; Morino, I.; Saitoh, N.; Niwa, Y.; Someya, Y.; Oishi, Y.; et al. Greenhouse gases Observing SATellite 2 (GOSAT-2): Mission overview. Prog. Earth Planet. Sci. 2023, 10, 33. [Google Scholar] [CrossRef]
Crisp, D. Measuring atmospheric carbon dioxide from space with the Orbiting Carbon Observatory-2 (OCO-2). In Proceedings of the Earth Observing Systems XX, San Diego, CA, USA, 9–13 August 2015; Butler, J.J., Xiong, X.J., Gu, X., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2015; Volume 9607, p. 960702. [Google Scholar] [CrossRef]
Taylor, T.E.; Eldering, A.; Merrelli, A.; Kiel, M.; Somkuti, P.; Cheng, C.; Rosenberg, R.; Fisher, B.; Crisp, D.; Basilio, R.; et al. OCO-3 early mission operations and initial (vEarly) X_CO₂ and SIF retrievals. Remote Sens. Environ. 2020, 251, 112032. [Google Scholar] [CrossRef]
Liu, Y.; Wang, J.; Yao, L.; Chen, X.; Cai, Z.; Yang, D.; Yin, Z.; Gu, S.; Tian, L.; Lu, N.; et al. The TanSat mission: Preliminary global observations. Sci. Bull. 2018, 63, 1200–1207. [Google Scholar] [CrossRef]
Li, Z.; Xie, Y.; Shi, Y.; Li, Q.; Cohen, J.; Zhang, Y.; Han, Y.; Xiong, W.; Liu, Y. A review of collaborative remote sensing observation of greenhouse gases and aerosol with atmospheric environment satellites. Natl. Remote Sens. Bull. 2022, 26, 795–816. [Google Scholar] [CrossRef]
Chen, L.; Zhang, Y.; Zou, M.; Xu, Q.; Li, L.; Li, X.; Tao, J. Overview of atmospheric CO₂ remote sensing from space. J. Remote Sens. 2015, 19, 1–11. [Google Scholar] [CrossRef]
Crevoisier, C.; Chédin, A.; Matsueda, H.; Machida, T.; Armante, R.; Scott, N.A. First year of upper tropospheric integrated content of CO₂ from IASI hyperspectral infrared observations. Atmos. Chem. Phys. 2009, 9, 4797–4810. [Google Scholar] [CrossRef]
Miao, Y.; Zou, M.; Sheng, S.; Zhu, K.; Ding, W.; Lin, J.; Qu, Z.; Li, D. CO₂ satellite inversion methocl based on machine learning. China Environ. Sci. 2023, 43, 20–27. [Google Scholar] [CrossRef]
Chen, W.; Ren, T.; Zhao, C.; Wen, Y.; Gu, Y.; Zhou, M.; Wang, P. Transformer-Based Fast Mole Fraction of CO₂ Retrievals from Satellite-Measured Spectra. J. Remote Sens. 2025, 5, 0470. [Google Scholar] [CrossRef]
Wu, L.; Hasekamp, O.; Hu, H.; Landgraf, J.; Butz, A.; aan de Brugh, J.; Aben, I.; Pollard, D.F.; Griffith, D.W.T.; Feist, D.G.; et al. Carbon dioxide retrieval from OCO-2 satellite observations using the RemoTeC algorithm and validation with TCCON measurements. Atmos. Meas. Tech. 2018, 11, 3111–3130. [Google Scholar] [CrossRef]
Yang, D.; Zhang, H.; Liu, Y.; Chen, B.; Cai, Z.; Lü, D. Monitoring carbon dioxide from space: Retrieval algorithm and flux inversion based on GOSAT data and using CarbonTracker-China. Adv. Atmos. Sci. 2017, 34, 965–976. [Google Scholar] [CrossRef]
Liu, Y.; Yao, L.; Wang, J.; Yang, D.; Cai, Z.; Lu, N.; Lyu, D. Application Status of carbon satellite data in China. Satell. Appl. 2022, 46–50. [Google Scholar] [CrossRef]
Yang, D.; Boesch, H.; Liu, Y.; Somkuti, P.; Cai, Z.; Chen, X.; Di Noia, A.; Lin, C.; Lu, N.; Lyu, D.; et al. Toward High Precision XCO₂ Retrievals from TanSat Observations: Retrieval Improvement and Validation Against TCCON Measurements. J. Geophys. Res. Atmos. 2020, 125, e2020JD032794. [Google Scholar] [CrossRef]
Buchwitz, M.; Rozanov, V.V.; Burrows, J.P. A near-infrared optimized DOAS method for the fast global retrieval of atmospheric CH₄, CO, CO₂, H₂O, and N₂O total column amounts from SCIAMACHY Envisat-1 nadir radiances. J. Geophys. Res. Atmos. 2000, 105, 15231–15245. [Google Scholar] [CrossRef]
Schneising, O.; Buchwitz, M.; Reuter, M.; Heymann, J.; Bovensmann, H.; Burrows, J.P. Long-term analysis of carbon dioxide and methane column-averaged mole fractions retrieved from SCIAMACHY. Atmos. Chem. Phys. 2011, 11, 2863–2880. [Google Scholar] [CrossRef]
Bovensmann, H.; Buchwitz, M.; Burrows, J.P.; Reuter, M.; Krings, T.; Gerilowski, K.; Schneising, O.; Heymann, J.; Tretner, A.; Erzinger, J. A remote sensing technique for global monitoring of power plant CO₂ emissions from space and related applications. Atmos. Meas. Tech. 2010, 3, 781–811. [Google Scholar] [CrossRef]
Yoshida, Y.; Ota, Y.; Eguchi, N.; Kikuchi, N.; Nobuta, K.; Tran, H.; Morino, I.; Yokota, T. Retrieval algorithm for CO₂ and CH₄ column abundances from short-wavelength infrared spectral observations by the Greenhouse gases observing satellite. Atmos. Meas. Tech. 2011, 4, 717–734. [Google Scholar] [CrossRef]
Oshchepkov, S.; Bril, A.; Yokota, T.; Morino, I.; Yoshida, Y.; Matsunaga, T.; Belikov, D.; Wunch, D.; Wennberg, P.; Toon, G.; et al. Effects of atmospheric light scattering on spectroscopic observations of greenhouse gases from space: Validation of PPDF-based CO₂ retrievals from GOSAT. J. Geophys. Res. Atmos. 2012, 117, D12305. [Google Scholar] [CrossRef]
Frankenberg, C.; O’Dell, C.; Guanter, L.; McDuffie, J. Remote sensing of near-infrared chlorophyll fluorescence from space in scattering atmospheres: Implications for its retrieval and interferences with atmospheric CO₂ retrievals. Atmos. Meas. Tech. 2012, 5, 2081–2094. [Google Scholar] [CrossRef]
Bösch, H.; Toon, G.C.; Sen, B.; Washenfelder, R.A.; Wennberg, P.O.; Buchwitz, M.; de Beek, R.; Burrows, J.P.; Crisp, D.; Christi, M.; et al. Space-based near-infrared CO₂ measurements: Testing the Orbiting Carbon Observatory retrieval algorithm and validation concept using SCIAMACHY observations over Park Falls, Wisconsin. J. Geophys. Res. Atmos. 2006, 111, D23302. [Google Scholar] [CrossRef]
Basu, S.; Krol, M.; Butz, A.; Clerbaux, C.; Sawa, Y.; Machida, T.; Matsueda, H.; Frankenberg, C.; Hasekamp, O.P.; Aben, I. The seasonal variation of the CO₂ flux over Tropical Asia estimated from GOSAT, CONTRAIL, and IASI. Geophys. Res. Lett. 2014, 41, 1809–1815. [Google Scholar] [CrossRef]
Liu, Y.; Yang, D.; Cai, Z. A retrieval algorithm for TanSat XCO₂ observation: Retrieval experiments using GOSAT data. Chin. Sci. Bull. 2013, 58, 1520–1523. [Google Scholar] [CrossRef]
Yang, D.; Liu, Y.; Cai, Z.; Deng, J.; Wang, J.; Chen, X. An advanced carbon dioxide retrieval algorithm for satellite measurements and its application to GOSAT observations. Sci. Bull. 2015, 60, 2063–2066. [Google Scholar] [CrossRef]
Yang, D.; Liu, Y.; Cai, Z.; Chen, X.; Yao, L.; Lu, D. First Global Carbon Dioxide Maps Produced from TanSat Measurements. Adv. Atmos. Sci. 2018, 35, 621–623. [Google Scholar] [CrossRef]
Hersbach, H.; Bell, B.; Berrisford, P.; Hirahara, S.; Horányi, A.; Muñoz-Sabater, J.; Nicolas, J.; Peubey, C.; Radu, R.; Schepers, D.; et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 2020, 146, 1999–2049. [Google Scholar] [CrossRef]
Jiang, Q.; Li, W.; Fan, Z.; He, X.; Sun, W.; Chen, S.; Wen, J.; Gao, J.; Wang, J. Evaluation of the ERA5 reanalysis precipitation dataset over Chinese Mainland. J. Hydrol. 2021, 595, 125660. [Google Scholar] [CrossRef]
Ma, J.; Zhu, Y.; Wang, P.; Duan, M. A Review on the developments of NCEP, ECMWF and CMC global ensemble forecast system. Trans. Atmos. Sci. 2011, 34, 370. [Google Scholar] [CrossRef]
Engel-Cox, J.A.; Holloman, C.H.; Coutant, B.W.; Hoff, R.M. Qualitative and quantitative evaluation of MODIS satellite sensor data for regional and urban scale air quality. Atmos. Environ. 2004, 38, 2495–2509. [Google Scholar] [CrossRef]
Kaufman, Y.J.; Tanré, D.; Remer, L.A.; Vermote, E.F.; Chu, A.; Holben, B.N. Operational remote sensing of tropospheric aerosol over land from EOS moderate resolution imaging spectroradiometer. J. Geophys. Res. Atmos. 1997, 102, 17051–17067. [Google Scholar] [CrossRef]
Kaufman, Y.J.; Tanré, D. Strategy for direct and indirect methods for correcting the aerosol effect on remote sensing: From AVHRR to EOS-MODIS. Remote Sens. Environ. 1996, 55, 65–79. [Google Scholar] [CrossRef]
Hyer, E.J.; Reid, J.S.; Zhang, J. An over-land aerosol optical depth data set for data assimilation by filtering, correction, and aggregation of MODIS Collection 5 optical depth retrievals. Atmos. Meas. Tech. 2011, 4, 379–408. [Google Scholar] [CrossRef]
Levy, R.C.; Mattoo, S.; Munchak, L.A.; Remer, L.A.; Sayer, A.M.; Patadia, F.; Hsu, N.C. The Collection 6 MODIS aerosol products over land and ocean. Atmos. Meas. Tech. 2013, 6, 2989–3034. [Google Scholar] [CrossRef]
Hu, K.; Feng, X.; Zhang, Q.; Shao, P.; Liu, Z.; Xu, Y.; Wang, S.; Wang, Y.; Wang, H.; Di, L.; et al. Review of Satellite Remote Sensing of Carbon Dioxide Inversion and Assimilation. Remote Sens. 2024, 16, 3394. [Google Scholar] [CrossRef]
Mai, B.; Deng, X.; An, X.; Zhou, L.; Tan, H.; Li, F.; Li, N. Simulation of typical surface CO₂ cases over Guangdong region base on Carbon Tracker numerical model. Acta Sci. Circumstantiae 2014, 34, 1833–1844. [Google Scholar] [CrossRef]
Gao, H.; Gu, X.; Zhou, X.; Yu, T.; Wang, Y. Analysis of the development trend of Chinese remote sensing validation sites and infrastructure construction. Natl. Remote Sens. Bull. 2023, 27, 1088–1098. [Google Scholar] [CrossRef]
Liang, A.; Gong, W.; Han, G.; Xiang, C. Comparison of Satellite-Observed XCO₂ from GOSAT, OCO-2, and Ground-Based TCCON. Remote Sens. 2017, 9, 1033. [Google Scholar] [CrossRef]
Zhou, M.; Shu, J.; Song, C.; Gao, W. Sensitivity studies for atmospheric carbon dioxide retrieval from atmospheric infrared sounder observations. J. Appl. Remote Sens. 2014, 8, 083697. [Google Scholar] [CrossRef]
Rong, P.; Zhang, C.; Liu, D.; Zhang, L.; Zhang, X.; Zhang, P.; Huyan, Z. Sensitivity analysis of an XCO₂ retrieval algorithm for high-resolution short-wave infrared spectra. Optik 2020, 209, 164502. [Google Scholar] [CrossRef]
Crisp, D.; Atlas, R.M.; Breon, F.M.; Brown, L.R.; Burrows, J.P.; Ciais, P.; Connor, B.J.; Doney, S.C.; Fung, I.Y.; Jacob, D.J.; et al. The Orbiting Carbon Observatory (OCO) mission. Adv. Space Res. 2004, 34, 700–709. [Google Scholar] [CrossRef]
Nelson, R.R.; Kulawik, S.S.; O’Dell, C.W.; McDuffie, J.; Eldering, A. Improving OCO-2 X_CO2 Retrievals Through the Scaling of Singular Value Decomposition-Based Temperature and Water Vapor Profiles. Earth Space Sci. 2025, 12, e2024EA003975. [Google Scholar] [CrossRef]
Rozanov, V.; Buchwitz, M.; Eichmann, K.U.; de Beek, R.; Burrows, J. Sciatran—A new radiative transfer model for geophysical applications in the 240–2400 NM spectral region: The pseudo-spherical version. Adv. Space Res. 2002, 29, 1831–1835. [Google Scholar] [CrossRef]
Butz, A.; Hasekamp, O.P.; Frankenberg, C.; Aben, I. Retrievals of atmospheric CO₂ from simulated space-borne measurements of backscattered near-infrared sunlight: Accounting for aerosol effects. Appl. Opt. 2009, 48, 3322–3336. [Google Scholar] [CrossRef]
Nicodemus, F.E.; Richmond, J.C.; Hsia, J.J.; Ginsberg, I.W.; Limperis, T. Geometrical Considerations and Nomenclature for Reflectance; Final Report National Bureau of Standards; Institute for Basic Standards: Washington, DC, USA, 1977.
Goody, R.M.; Yung, Y.L. Atmospheric Radiation: Theoretical Basis; Oxford University Press: Oxford, UK, 1989. [Google Scholar] [CrossRef]
Mao, J.; Kawa, S.R. Sensitivity studies for space-based measurement of atmospheric total column carbon dioxide by reflected sunlight. Appl. Opt. 2004, 43, 914–927. [Google Scholar] [CrossRef]
Breiman, L. Random Forests–Random Features. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Rodgers, C.D. Inverse Methods for Atmospheric Sounding: Theory and Practice; World Scientific: Singapore, 2000; Volume 2. [Google Scholar] [CrossRef]
Ye, H.; Wang, X.; Wu, J.; Fang, Y. Error matrix construction method for atmospheric carbon dioxide Bayesian retrieval. Infrared Laser Eng. 2014, 43, 249–253. [Google Scholar]
Li, R.; Zhou, X.; Cheng, T.; Tao, Z.; Gu, X.; Wang, N.; Zhang, H.; Lv, T. The Influence of Validation Colocation on XCO₂ Satellite–Terrestrial Joint Observations. Remote Sens. 2023, 15, 5270. [Google Scholar] [CrossRef]
Friedl, M.; Sulla-Menashe, D. MCD12Q1 MODIS/Terra+Aqua Land Cover Type Yearly L3 Global 500m SIN Grid V006 [Data Set]. 2019. Available online: https://www.earthdata.nasa.gov/data/catalog/lpcloud-mcd12q1-006 (accessed on 8 June 2025).
Wunch, D.; Wennberg, P.O.; Osterman, G.; Fisher, B.; Naylor, B.; Roehl, C.M.; O’Dell, C.; Mandrake, L.; Viatte, C.; Kiel, M.; et al. Comparisons of the Orbiting Carbon Observatory-2 (OCO-2) X_CO₂ measurements with TCCON. Atmos. Meas. Tech. 2017, 10, 2209–2238. [Google Scholar] [CrossRef]
Sha, M.K.; Langerock, B.; Blavier, J.F.L.; Blumenstock, T.; Borsdorff, T.; Buschmann, M.; Dehn, A.; De Mazière, M.; Deutscher, N.M.; Feist, D.G.; et al. Validation of methane and carbon monoxide from Sentinel-5 Precursor using TCCON and NDACC-IRWG stations. Atmos. Meas. Tech. 2021, 14, 6249–6304. [Google Scholar] [CrossRef]
Morais Filho, L.F.F.; de Meneses, K.C.; de Araújo Santos, G.A.; da Silva Bicalho, E.; de Souza Rolim, G.; La Scala Jr, N. XCO₂ temporal variability above Brazilian agroecosystems: A remote sensing approach. J. Environ. Manag. 2021, 288, 112433. [Google Scholar] [CrossRef]
Zhao, H.; Fan, J.; Gu, B.; Chen, Y. Carbon sink response of terrestrial vegetation ecosystems in the Yangtze River Delta and its driving mechanism. J. Geogr. Sci. 2024, 34, 112–130. [Google Scholar] [CrossRef]
Chang, Z.; Fan, L.; Wigneron, J.P.; Wang, Y.P.; Ciais, P.; Chave, J.; Fensholt, R.; Chen, J.M.; Yuan, W.; Ju, W.; et al. Estimating Aboveground Carbon Dynamic of China Using Optical and Microwave Remote-Sensing Datasets from 2013 to 2019. J. Remote Sens. 2023, 3, 0005. [Google Scholar] [CrossRef]
Ripple, W.J.; Wolf, C.; Newsome, T.M.; Barnard, P.; Moomaw, W.R. Corrigendum: World Scientists’ Warning of a Climate Emergency. BioScience 2019, 70, 100. [Google Scholar] [CrossRef]
Benton-Short, L.; Short, J.R. Cities and Nature; Routledge: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Li, Y.; Yan, J.; Zhong, L.; Bao, D.; Sun, L.; Li, G. Full-Coverage Mapping of Daily High-Resolution XCO₂ across China from 2015 to 2020 by Deep Learning-Based Spatio-Temporal Fusion. IEEE Trans. Geosci. Remote Sens. 2025. early access. [Google Scholar] [CrossRef]
Watts, J.D.; Farina, M.; Kimball, J.S.; Schiferl, L.D.; Liu, Z.; Arndt, K.A.; Zona, D.; Ballantyne, A.; Euskirchen, E.S.; Parmentier, F.J.W.; et al. Carbon uptake in Eurasian boreal forests dominates the high-latitude net ecosystem carbon budget. Glob. Change Biol. 2023, 29, 1870–1889. [Google Scholar] [CrossRef]

Figure 1. Simulated spectra of the weak CO₂ band with SCIATRAN under different AOD scenarios.

Figure 2. Simulated spectra of the weak CO₂ band with SCIATRAN under different reflectance (R) scenarios.

Figure 3. Simulated spectra of the weak CO₂ band with SCIATRAN under different temperature perturbations. (a) Simulated spectra radiance. (b) Radiance change rate.

Figure 4. Simulated spectra of the weak CO₂ band with SCIATRAN under different water vapor perturbations. (a) Simulated spectra radiance. (b) Radiance change rate.

Figure 5. The ranking of feature importance scores.

Figure 6. The overall workflow of this study, illustrating the integration of a statistical model (right panel) with a physical retrieval framework (left panel). The statistical model is trained to generate real-time CO₂ profiles, which are then used as dynamic a priori constraints in the physical retrieval process.

Figure 7. RF-CPE model stratification evaluation results. (a) RMSE. (b) Std.

Figure 8. Validation results in terms of CO₂ concentrations in near-surface four-layer atmosphere. (a) First layer. (b) Second Layer. (c) Third Layer. (d) Fourth Layer.

Figure 9. Comparison of CO₂ retrieval accuracy using static and generated prior profiles.

Figure 10. Comparison of retrieval XCO₂ with TCCON sites’ XCO₂.

Figure 11. Land cover composition within a ±5° spatial window centered on the Caltech site.

Figure 12. Monthly average atmospheric CO₂ concentrations in China and surrounding areas in 2017 (data missing for January and February).

Figure 13. Same as Figure 12, but in 2018.

Table 1. Main retrieval algorithms and their properties.

Features and Advantages	Basic Principles	Algorithm Name
Subsequent improvements (introduction of M factor and cloud detection) effectively compensated for instrument drift and environmental influences. Suitable for SCIAMACHY data retrieval.	Unconstrained linear least squares. Fitting the normalized solar radiation by scaling or shifting the preselected vertical profile. Introducing a weight function to describe CO₂ absorption characteristics.	WFM-DOAS [22]
Suitable for complex observation geometries and atmospheric conditions. Subsequent versions (such as BESD/C [23]) extended the error parameterization and improved the computation efficiency.	Combining WFM-DOAS with optimal estimation methods. Introducing prior information to constrain retrieval and optimize state vectors.	BESD [24]
Optimized for GOSAT data characteristics. Filter out abnormal observation spectra contaminated by factors such as clouds to improve the reliability of retrieval data.	Developed for GOSAT’s TANSO-FTS data based on an optimal estimation method. Built-in cloud detection and aerosol processing modules.	NIES [25]
Accurately considers errors in atmospheric scattering and thin cloud conditions. Applicable to situations with complex atmospheric structures or significant scattering.	Parameterization method using the probability density function of the photon path length. Simulates the transmission process of photons in the atmosphere and calculates the scattering effects of clouds and aerosols.	PPDF [26]
Strong parameter optimization capabilities. Stable performance in global retrieval applications after multiple verifications. Effective processing of narrowband observation information.	Based on the optimal estimation method, the simulated spectrum and the observed spectrum are optimally matched by adjusting the forward model parameters. Applicable to OCO series and GOSAT data.	ACOS [27]
High physical consistency, more detailed description of atmospheric states. Able to adapt to different observation conditions widely.	Uses a complete physical model (including radiation transfer, solar spectrum, and instrument response) combined with optimal estimation to achieve retrieval. Comprehensive processing for multiband and multiangle observations.	UoL-FP [28]
Synchronous retrieval of gas and aerosol to reduce errors caused by scattering effects. Excellent performance under complex cloud and aerosol conditions.	Simultaneously retrieves XCO₂ and key aerosol parameters. Parameterizes the number of particles, vertical distribution, and microphysical properties to compensate for errors caused by atmospheric scattering.	RemoTeC [29]
Synchronous retrieval of aerosol optical thickness, cloud profile, and effective radius to reduce thin cloud/aerosol errors. Multidimensional lookup and O₂-A band rapid cloud screening, enabling global processing in complex scattering scenes.	Iterative optimal estimation of XCO₂, water vapor/temperature profiles, surface albedo, and wavenumber drift to fit spectra and retrieve XCO₂. Pre-computed CO₂/O₂/H₂O absorption tables replacing LBLRTM, integrated with VLIDORT for multilayer gas and cloud scattering.	IAPCAS [30,31]

Table 2. Basic information for TCCON validation sites.

Index	Site Name	Longitude	Latitude	Country	Continent
1	Burgos	120.65°E	18.53°N	Philippines	Asia
2	Hefei	117.17°E	31.90°N	China
3	Saga	130.29°E	33.24°N	Japan
4	Tsukuba	140.12°E	36.05°N	Japan
5	Caltech	118.13°W	34.14°N	America	North America
6	East Trout Lake	104.99°W	54.36°N	Canada
7	Lamont	97.49°W	36.60°N	America
8	Park Falls	90.27°W	45.94°N	America
9	Garmisch	11.06°E	47.48°N	Germany	Europe
10	Karlsruhe	8.44°E	49.10°N	Germany
11	Orléans	2.11°E	47.96°N	France
12	Paris	2.36°E	48.85°N	France
13	Darwin	130.89°E	12.42°N	Australia	Oceania
14	Wollongong	150.88°E	34.41°S	Australia	Oceania

Table 3. High-dimensional feature space.

Index	Feature Value	Data Source
1	wCO₂ absorption band radiance	TanSat Nadir Observations
2	Surface pressure	ERA5/ECMWF
3	Temperature	ERA5/ECMWF
4	Wind-u	ERA5/ECMWF
5	Wind-v	ERA5/ECMWF
6	Total column water	ERA5/ECMWF
7	Humidity	ERA5/ECMWF
8	AOD	MODIS/MYD04_L2
9	NDVI	MODIS/MYD13A3
10	Reflectance	MODIS/MYD09GA

Table 4. Space matching results of ground-based sites.

Region	Data Volume	R²	RMSE (ppm)	Std (ppm)
Asia	109	0.68	1.85	1.73
North America	343	0.84	2.26	1.69
Europe	200	0.67	2.06	1.54
Oceania	109	0.81	0.77	0.73
Total	761	0.76	1.99	1.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, S.; Wang, Y.; Zhang, L.; Jia, H.; Zhang, X.; Xu, L.; Dai, Y. A Random Forest-Based CO₂ Profile Emulator for Real-Time Prior Profile Generation in TanSat XCO₂ Retrieval. Remote Sens. 2025, 17, 2764. https://doi.org/10.3390/rs17162764

AMA Style

Wu S, Wang Y, Zhang L, Jia H, Zhang X, Xu L, Dai Y. A Random Forest-Based CO₂ Profile Emulator for Real-Time Prior Profile Generation in TanSat XCO₂ Retrieval. Remote Sensing. 2025; 17(16):2764. https://doi.org/10.3390/rs17162764

Chicago/Turabian Style

Wu, Shaojie, Yang Wang, Likun Zhang, Heng Jia, Xianmei Zhang, Linglin Xu, and Yunxiao Dai. 2025. "A Random Forest-Based CO₂ Profile Emulator for Real-Time Prior Profile Generation in TanSat XCO₂ Retrieval" Remote Sensing 17, no. 16: 2764. https://doi.org/10.3390/rs17162764

APA Style

Wu, S., Wang, Y., Zhang, L., Jia, H., Zhang, X., Xu, L., & Dai, Y. (2025). A Random Forest-Based CO₂ Profile Emulator for Real-Time Prior Profile Generation in TanSat XCO₂ Retrieval. Remote Sensing, 17(16), 2764. https://doi.org/10.3390/rs17162764

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Random Forest-Based CO₂ Profile Emulator for Real-Time Prior Profile Generation in TanSat XCO₂ Retrieval

Abstract

1. Introduction