Improved Data Processing and a Prior Profile Generation Method for Precise Retrieval of Atmospheric CO2 Based on a Laser Heterodyne Radiometer

Fu, Nianna; Chen, Zhao; Liu, Kun; Gao, Xiaoming; Wang, Guishi

doi:10.3390/rs17233791

Open AccessArticle

Improved Data Processing and a Prior Profile Generation Method for Precise Retrieval of Atmospheric CO₂ Based on a Laser Heterodyne Radiometer

by

Nianna Fu

^1,2

,

Zhao Chen

^1,2,

Kun Liu

²,

Xiaoming Gao

² and

Guishi Wang

^2,*

¹

University of Science and Technology of China, Hefei 230026, China

²

Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(23), 3791; https://doi.org/10.3390/rs17233791

Submission received: 4 October 2025 / Revised: 13 November 2025 / Accepted: 18 November 2025 / Published: 21 November 2025

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel random forest-based model was developed to generate high-accuracy, real-time atmospheric a priori profiles for CO₂ retrieval using only ground-level meteorological data.
The application of a Locally Weighted Scatterplot Smoothing (LOWESS) method for baseline correction improved the signal-to-noise ratio of the LHR spectra by nearly five-fold compared to conventional polynomial fitting.

What are the implications of the main findings?

The machine learning-based prior profile generation method significantly reduces reliance on static climatological models, improving the accuracy and timeliness of atmospheric retrievals and lowering the uncertainty of retrieved to 0.16%.
The combined data processing and inversion framework, implemented in Python, provides a robust and automatable pathway for high-precision, real-time monitoring of atmospheric CO₂, enhancing the capability of ground-based observation networks.

Abstract

The laser heterodyne radiometer (LHR) is a promising technique for atmospheric remote sensing due to its exceptionally high spectral resolution and sensitivity. A model based on a random forest algorithm is proposed to generate highly accurate prior atmospheric profiles using real-time meteorological parameters. In addition, a locally weighted scatter plot smoothing (LOWESS) method is applied for baseline correction during data preprocessing. An inversion algorithm is implemented using the Py4CAtS radiative transfer model, in which quadratic baseline parameters are included in the iterative process. Continuous measurements of the atmospheric

{CO}_{2}

absorption spectrum were made in our laboratory (Hefei, China, 31.9°N, 117.16°E), and the dry mixing ratio (

X_{{CO}_{2}}

) was obtained after data processing and inversion. The results demonstrate that this research improves the accuracy of LHR signal inversion. The implemented Python-based framework shows potential for real-time atmospheric

{CO}_{2}

monitoring.

Keywords:

laser heterodyne radiometer; random forest algorithm; prior profiles; baseline correction; spectral inversion

1. Introduction

Carbon dioxide (CO₂) is the most predominant anthropogenic greenhouse gas, accounting for approximately 74% of total anthropogenic greenhouse gas emissions when expressed in CO₂-equivalent terms. It plays a pivotal role in driving climate change [1,2]. This warming trend manifests in a cascade of interconnected environmental shifts, including rising global mean surface temperatures, accelerated melting of glaciers and ice sheets, and an alarming increase in the frequency and intensity of extreme weather phenomena like heatwaves and heavy precipitation. To effectively mitigate these impacts and track the global carbon budget, it is imperative to have highly accurate measurements of atmospheric dry-column mole fractions (

X_{C O_{2}}

). In this context, ground-based remote sensing techniques, particularly Fourier transform infrared (FTIR) spectroscopy, have proven indispensable for validating satellite observations and maintaining long-term atmospheric records [3].

Complementing these established methodologies, recent advancements in miniaturized, high-sensitivity monitoring technologies have gained prominence for localized, in-situ greenhouse gas measurements. Techniques such as photoacoustic spectroscopy (PAS) and light-induced thermoelastic spectroscopy (LITES), for example, offer practical advantages like simpler instrumentation and excellent immunity to vibrations [4,5]. However, these methods typically probe only a single absorption line and exhibit lower spectral resolution compared to techniques designed for open-path column measurements. This limitation is particularly critical for retrieving total column abundances, as the precise shape of an absorption line—which is broadened by pressure at different altitudes—contains vital information about the vertical distribution of the gas. Without the ability to resolve these subtle line shape features, accurately separating the contributions from different atmospheric layers becomes challenging. It is in this context that Laser Heterodyne Radiometry (LHR) stands out.

As a high-resolution, passive remote sensing technique, LHR has emerged as a highly effective tool for atmospheric trace gas measurements, owing to its exceptional spectral resolution (>

10^{7}

) and quantum-limited sensitivity, which are ideal for resolving pressure-broadened line shapes from ground to space [6,7,8]. The development of the laser heterodyne radiometer (LHR) represents a significant milestone in atmospheric science. Since its initial demonstration by Menzies and Shumate in the 1970s [9,10], substantial progress has been achieved in both instrument design and retrieval methodologies. For instance, Tsai et al. introduced proper instrument line shape functions and rigorous error analysis frameworks [11], while Wilson et al. developed a compact LHR system suitable for field deployment [12]. More recently, Wang et al. demonstrated the simultaneous measurement of CO₂ and CH₄ using distributed feedback lasers, further advancing the practical applications of LHR systems [13]. However, the ultimate accuracy of any ground-based retrieval is fundamentally tied not only to the instrument’s quality but also to the fidelity of the atmospheric state information used in the inversion algorithm.

Despite these advancements, retrieving atmospheric parameters from LHR data remains a challenging inverse problem due to the limited number of observational degrees of freedom relative to the number of unknown variables [14]. A priori information is therefore essential to constrain solutions and ensure physical consistency [15]. Studies on similar ground-based spectrometers have shown that inaccurate a priori profiles can introduce biases of up to 2–3 parts per million (ppm) in column-averaged dry air mole fractions [16]. Traditional approaches, which rely on historical or monthly averaged profiles, often fail to capture real-time regional variations, leading to discrepancies between retrieved results and actual atmospheric conditions. Consequently, there is an urgent need for adaptive profile generation methods that can account for real-time atmospheric variability.

Machine learning algorithms offer promising solutions by enabling the real-time generation of atmospheric profiles [17]. Recent studies have demonstrated their potential to overcome the limitations of conventional methods. For example, Smith and Barnet improved temperature retrievals using neural networks [18], while Fujita et al. characterized atmospheric structures with advanced predictive models [19]. Nevertheless, sophisticated data processing techniques remain indispensable for achieving precise CO₂ retrievals in LHR systems.

To address these challenges, we propose a machine learning-based framework for real-time atmospheric profile generation and enhanced CO₂ retrieval accuracy. Our approach integrates locally weighted scatter plot smoothing (LOWESS) baseline correction within an optimal estimation algorithm implemented using Python for Computational Atmospheric Spectroscopy (Py4CAtS) [20]. Validation experiments conducted in Hefei (31.9°N, 117.16°E) on 22 August 2025, achieved stable retrievals with an uncertainty of 0.16% for

X_{C O_{2}}

[21]. Detailed methodologies, data processing steps, and results are presented in the subsequent sections.

2. Methodology for Generating Prior Atmospheric Profiles

The proposed prior profile generation model offers an efficient and accurate means of producing atmospheric profiles necessary for retrieval processes. By leveraging real-time meteorological parameters as inputs, the trained model outputs profiles including temperature, pressure, and six primary atmospheric gases (Figure 1). The dataset used for model development was derived from published FTIR observations and synchronized prior profiles obtained at the TCCON station in Hefei. The TCCON network is globally recognized as the gold standard for ground-based column measurements of greenhouse gases, providing a high-quality, rigorously validated, and consistent dataset that is ideal for training robust machine learning models [22,23].

Prior to model training, the dataset underwent several quality control procedures. To address gaps in the data record, the median filling method was employed for imputation. This is a robust missing values handling method suitable for the case of skewed distributions and outliers that may exist in atmospheric science data. After filling, the profiles were also screened for physical rationality, which ensured the quality of the data and provided a reliable data basis for subsequent principal component analysis and random forest modeling.

Principal component analysis (PCA) and random forest training were implemented in the model. PCA, a classical multivariate statistical method for dimensionality reduction, transforms high-dimensional, strongly correlated vertical profile data into orthogonal, uncorrelated principal components [24]. To determine the optimal number of principal components, the cumulative variance ratiothreshold method was applied. This is mathematically expressed as:

R_{cvr} = \frac{\sum_{i = 1}^{k} λ_{i}}{\sum_{i = 1}^{M} λ_{i}}

(1)

where

R_{cvr}

represents the cumulative variance ratio,

λ_{i}

denotes the eigenvalues corresponding to each principal component, k indicates the number of selected components, and M is the total number of components.

For each atmospheric profile, the minimum number of principal components necessary to achieve the preset cumulative explained variance ratio threshold was determined (Figure 2). The figure illustrates the relationship between the cumulative explained variance and the number of principal components for each profile. A red horizontal line denotes the preset variance threshold, while a green vertical line indicates the number of components required to meet this threshold. The blue curve represents the cumulative explained variance, demonstrating how additional components incrementally capture the total variance. This visualization facilitates the identification of the optimal number of principal components, balancing dimensionality reduction with information retention. Subsequently, PCA was applied to the preprocessed atmospheric profiles to reduce their dimensionality, yielding low-dimensional, uncorrelated inputs for subsequent random forest modeling.

The Random Forest algorithm was specifically chosen for its robustness to overfitting [25], its ability to handle non-linear relationships effectively, and its relatively straightforward interpretability compared to more ‘black-box’ models like deep neural networks. During training, the algorithm generates an ensemble of regression trees with diverse structures through random sampling of instances and features [26]. For a new input sample, predictions from all trees are averaged to compute the final output, as shown below:

\hat{y} = \frac{1}{B} \sum_{b = 1}^{B} f_{b} (x)

(2)

Here,

\hat{y}

denotes the predicted value, B represents the total number of regression trees, and

f_{b} (x)

corresponds to the prediction of the b-th tree for input feature x. To enhance robustness and mitigate overfitting, cross-validation and hyperparameter tuning were performed (Figure 3). Randomized-Search-CV was utilized to explore the parameter space systematically after defining hyperparameter distributions. Notable variations in mean squared error (MSE) were observed across different hyperparameter configurations, influencing the training performance of the random forest model. As depicted in Figure 3, the MSE distributions for various atmospheric profiles exhibit differences depending on the selected hyperparameters. To mitigate this issue, we performed individualized hyperparameter optimization for each profile to determine the most effective parameter settings. The optimal hyperparameter combinations are comprehensively outlined in Appendix A.

As illustrated in Figure 4, key observations include the following: Temperature, pressure, H₂O, and CO₂ exhibit tight clustering around the diagonal line, suggesting excellent model performance for these variables. In contrast, CH₄ and N₂O demonstrate slightly wider dispersion, indicative of moderate prediction uncertainty. Meanwhile, O₂ and CO show consistent patterns with minimal outliers, confirming their reliable predictions.These overall distribution characteristics validates that the random forest model effectively captures the complex nonlinear relationships between input features and principal components. This capability facilitates accurate reconstruction of atmospheric state variables from spectral data, underscoring the model’s robustness and precision.

The performance metrics in Table 1 demonstrate that the model achieves accurate fitting while maintaining strong generalization across both training and test sets. In the context of atmospheric profile modeling, the low Mean Squared Error (MSE) values indicate that the average prediction error is small relative to the natural variability of the profiles, ensuring the generated a priori information is a reliable starting point for the inversion. Concurrently, the high coefficient of determination (

R^{2}

) values, ranging from 0.88 to 0.97, are particularly significant [27]. An

R^{2}

value exceeding 0.9 is widely considered to indicate a model that effectively explains the variance of the reference data. Notably, the model demonstrates strong generalization for our primary target, CO₂, achieving an

R^{2}

of 0.94 on the training set and maintaining a high value of 0.92 on the test set. This confirms that the model has successfully learned the key variability characteristics of the gas profiles without significant overfitting, a prerequisite for enhancing retrieval accuracy [28]. Detailed evaluation of the model’s performance on the test set is provided in Appendix C. The integration of real-time meteorological parameters via the API (api.openweathermap.org) [29] enables the generation of corresponding atmospheric profiles at specific times.

Due to the limited size of the training dataset, geographic coordinates such as longitude and latitude were not utilized as features. Consequently, the model’s current applicability is confined to the Hefei region. Future work is expected to incorporate datasets from diverse geographical locations to expand the model’s regional applicability.

3. Data Processing

With the prior profile generation model established, we now focus on preprocessing the raw LHR signals to ensure high-quality data for subsequent retrieval analysis. The raw signals were acquired at a spectral resolution of

0.012 {cm}^{- 1}

in Hefei, China (

31 . 9^{\circ} N, 117 . 16^{\circ} E

), on 22 August 2025, through point-by-point scanning. The experimental setup and signal acquisition process are illustrated in Figure 5. While the experimental details are beyond the scope of this article, further information can be found in the following references [30,31,32,33]. Before proceeding with data retrieval, the raw signals (Figure 6) underwent several preprocessing steps, including solar signal correction, normalization, and wavenumber offset calibration.

To mitigate fluctuations in the solar signal [34], the LHR signal was normalized by dividing it with the synchronously acquired solar signal, effectively eliminating variations caused by solar instability. Subsequently, the Locally Weighted Scatterplot Smoothing (LOWESS) adaptive baseline fitting method was applied to further refine the normalized signal, addressing both baseline drift and noise interference (Figure 7).

The LOWESS method estimates a slowly varying baseline by performing weighted least squares regression within the local neighborhood of each wavenumber point. This involves minimizing the following objective function:

\begin{matrix} β (ν_{i}) = arg min_{β} \sum_{j} K (\frac{| ν_{j} - ν_{i} |}{h}) \times {[R (ν_{j}) - (β_{0} + β_{1} ν_{j})]}^{2} \end{matrix}

(3)

Here,

β = {[β_{0}, β_{1}]}^{T}

denotes the regression coefficients at wavenumber

ν_{i}

, h is the bandwidth controlling the local neighborhood size, and

K (u)

is the kernel function weighting neighboring points based on their distance from

ν_{i}

. The kernel function is defined as:

K (u) = {(1 - | u |}^{3})^{3} \times 1_{| u | < 1}

(4)

The indicator function

1_{| u | < 1}

equals 1 when

| u | < 1

and 0 otherwise, defining weight decay for neighboring points. Non-absorbing spectral regions are used to construct the dataset, ensuring only relevant data contribute to baseline estimation.

The smoothing parameter h, which determines the bandwidth for local regression in LOWESS, is directly proportional to the total number of data points N, scaled by the fraction specified via the frac parameter. Mathematically, this relationship can be expressed as

h = frac \times N

, where frac controls the proportion of data points included in each local neighborhood. A smaller frac value results in a narrower bandwidth, capturing finer details but potentially increasing noise sensitivity, while a larger frac smooths over broader regions, mitigating noise but risking the loss of subtle spectral features. After determining the appropriate smoothing parameter, linear interpolation is applied across the entire spectral domain, followed by normalization using the following equation:

R_{corr} (ν) = \frac{R (ν)}{b (ν)}

(5)

Here,

R_{corr} (ν)

is the spectral signal after baseline correction, where

R (ν)

is the originally measured spectral signal and

b (ν)

denotes the estimated baseline derived via the LOWESS procedure. To assess the performance of various baseline fitting methods, 75 data points from before and after the LHR measurement at 9:52:11 were chosen for signal-to-noise ratio (SNR) calculations. As illustrated in Figure 7b,d, the LOWESS method achieves an SNR value of 297.17 when frac = 0.05, which is approximately four times higher than the SNR of 61.5 obtained using polynomial fitting.

This substantial improvement in SNR directly contributes to enhanced retrieval accuracy, as reduced noise levels facilitate clearer differentiation of weak absorption features from background fluctuations. Additionally, the frac parameter in LOWESS offers precise control over the degree of smoothing, as illustrated in Figure 7b. This tunable balance between noise suppression and the retention of subtle spectral characteristics underscores the adaptability of LOWESS, particularly in complex spectral regions, ultimately leading to improved sensitivity and reliability in spectral inversion processes.

To address residual micro-drift inherent in point-by-point scanning and wavelength meter calibration, a correlation coefficient-based calibration method was employed. Specifically, the measured spectrum is shifted relative to a forward reference spectrum, and the correlation coefficient is computed at each offset. Through an iterative refinement process involving interpolation, the wavelength offset corresponding to the maximum correlation coefficient is accurately determined, ensuring high-precision wavelength correction. This step is critical because even minor spectral shifts, on the order of a fraction of the instrument’s resolution, can introduce significant systematic errors in the retrieved column abundance by causing a mismatch between the measured spectrum and the forward model. With these preprocessing steps completed, the data were prepared for subsequent retrieval.

4. Data Retrieval

After completing the data preprocessing, atmospheric parameters were retrieved using the Optimal Estimation Method (OEM), as originally described by Rodgers (1977) [35] and further refined in recent studies (e.g., [36,37]). The OEM framework establishes a mathematical relationship between the measurements and the atmospheric state vector x, expressed as:

\begin{matrix} y_{m} = F (x, a, b, c) + ϵ, \end{matrix}

(6)

Here,

y_{m}

denotes the measurement vector of the observed radiance spectrum. The forward model

F (x)

simulates the measured spectrum based on the atmospheric state vector x, while

ϵ

represents the combined effects of measurement noise and model uncertainties.

To address non-selective absorption and residual baseline effects within the spectral region, a second-order polynomial baseline correction was integrated into the retrieval process, consistent with established methodologies for spectroscopic data analysis [13]. Parameters a, b, and c represent the coefficients of the quadratic polynomial baseline. The retrieval procedure employs an iterative optimization scheme based on Bayesian inference, utilizing Gaussian probability density functions to minimize the cost function [38,39]. This approach has been widely adopted in atmospheric remote sensing applications.

The atmospheric transmission spectrum is modeled using the radiative transfer forward model F, incorporating spectroscopic databases such as HITRAN or GEISA. This model is implemented within Py4CAtS (Python for Computational Atmospheric Spectroscopy, version 4.0.0), a Python-based radiative transfer package. Py4CAtS re-implements the Fortran-encoded GARLIC framework, which is primarily designed for infrared and microwave radiative transfer calculations. To ensure high accuracy, Py4CAtS employs a line-by-line integration approach. This method determines atmospheric transmittance by computing individual spectral lines, thereby providing precise results.

Py4CAtS operates by solving the radiative transfer equation (Equation (7)) for each spectral line individually, enabling high precision in modeling the interaction between radiation and atmospheric constituents. The core principle involves calculating both absorption and emission contributions of individual molecular transitions along a defined atmospheric path. This process leverages detailed spectroscopic databases such as HITRAN or GEISA to retrieve accurate line parameters, including line positions, intensities, and broadening coefficients. This method has also been successfully applied in atmospheric inversion studies, for example by Wang et al. (2023) [13].

The radiative transfer equation is expressed as:

\begin{matrix} I_{ν} (s) = I_{ν} (0) e^{- τ_{ν} (s)} + \int_{0}^{s} S_{ν} (s^{'}) e^{- τ_{ν} (s^{'}, s)} κ_{ν} (s^{'}) d s^{'}, \end{matrix}

(7)

where

I_{ν} (s)

is the specific intensity at optical depth s,

I_{ν} (0)

the initial intensity,

τ_{ν} (s)

the total optical depth up to s,

S_{ν} (s^{'})

the source function, and

κ_{ν} (s^{'})

the absorption coefficient at

s^{'}

.

The relationship between total optical depth (

τ_{ν}

) and atmospheric transmittance (

T_{v}

) is established using the Beer-Lambert law. Transmittance is defined as the fraction of incident radiation that passes through a medium, which can be mathematically expressed as:

T_{ν} = e^{- τ_{ν}}

(8)

Here,

T_{ν}

denotes the transmittance at a specific wavenumber

ν

, while

τ_{ν}

represents the corresponding total optical depth.

The transformation from vertical to slant-path transmittance is essential for accurately modeling atmospheric absorption along the observation path. This process involves modifying the optical depth based on the solar zenith angle (

θ

). By applying the Beer-Lambert law, the relationship between vertical (

T_{v}

) and slant-path (

T_{s}

) transmittance can be formulated as follows:

T_{s} = T_{v}^{\frac{1}{cos (θ)}}

(9)

The air mass factor (AMF), defined as

AMF = \frac{1}{cos (θ)}

, quantifies the elongation of the optical path due to the increased path length at non-zero zenith angles. For absorbing media, where

T_{v} < 1

, it follows that

T_{s} < T_{v}

when

AMF > 1

. This result aligns with the physical intuition that longer optical paths lead to greater absorption.

By integrating over all contributing spectral lines, Py4CAtS constructs a comprehensive representation of the atmospheric transmission spectrum, thereby ensuring consistency with physical principles and observed data.

Furthermore, the incorporation of instrumental effects, particularly the ILS, significantly improves the accuracy of the simulated spectra, aligning them more closely with real-world measurement characteristics. The calculation of the slant-path transmittance spectrum is performed prior to the application of instrument line shape (ILS) convolution. Specifically, zenith angle correction is applied to the radiative transfer modeling, ensuring a more precise depiction of atmospheric absorption along the designated observation path (Figure 8).

The forward calculation program computes the forward spectrum and Jacobian matrix. The input parameters include atmospheric profiles (pressure, temperature, and atmospheric species), the Instrument Line Shape (ILS) (Figure 9), and the solar zenith angle.

In the inverse program, the forward calculation is iteratively performed. This iterative process minimizes the cost function using the Levenberg-Marquardt (LM) algorithm. The LM algorithm was selected as it provides a robust and efficient compromise between the Gauss–Newton algorithm and the method of gradient descent, making it well-suited for non-linear least-squares problems common in atmospheric retrieval. The state vectors in this study represent the retrieved atmospheric parameters.

To illustrate the data retrieval process, representative data collected at 9:52:11 on 22 August 2025 were selected. The pressure and temperature profiles used for data retrieval are shown in Figure 10, both derived from the trained model. The atmosphere was discretized into 45 layers from the surface to 70 km, with varying altitude resolutions of 0.5, 1, 2, 3, 4, and 4 km, respectively. The prior profiles of CO₂ and H₂O were also obtained from the model output. The real-time weather parameters used as input to the model are detailed in Appendix B. Additionally, the real-time solar zenith angles were determined using Pysolar [40], consistent with the NOAA solar calculator [41].

The LM iteration process converges when

Δ χ^{2} / χ^{2}

falls below

1 \times 10^{- 3}

, with a maximum limit of 20 iterations. For effective retrieval, the ultimate cost function (

χ^{2}

) should approximate the size of the measurement vector (m). The inversion results are shown in Figure 11. In addition to the scaling factor, the three parameters of the quadratic polynomial baseline (Figure 12) were iteratively adjusted to eliminate unselected power contributions. The fitting spectrum, baseline, and residuals align with the point intervals of the preprocessed LHR data.

The iterative processes are depicted in the inset of Figure 11a, where

χ^{2} / m

approaches unity after 10 iterations. As illustrated in Figure 12b, even when a trial step causes the parameter b to exceed its expected range during early iterations, the algorithm effectively corrects the deviation through subsequent iterations, demonstrating robust convergence behavior. After convergence, the retrieved CO₂ profiles were calculated by multiplying the retrieved scaling factor with the relative a priori profile. Using the water vapor profile from the model, the dry-air mixing ratio (

X_{C O_{2}}

) was calculated as approximately 421.77 ppm.

As illustrated in Figure 13, consecutive measurements conducted on 22 August 2025, demonstrate the high precision of the

X_{C O_{2}}

retrieval process. The retrieved

X_{C O_{2}}

values exhibit a statistical mean of 422.04 ppm with a standard deviation of approximately 0.68 ppm.

5. Discussion

This study presents a novel framework for retrieving atmospheric parameters by integrating laser heterodyne radiometry (LHR) with random forest-based prior profile generation. The LHR experiment was conducted in Hefei, China, on 22 August 2025. Real-time weather parameters were employed to generate prior profiles using the developed prediction model, while LOWESS was applied to preprocess the LHR signals, leading to a robust set of

X_{C O_{2}}

inversion results. The precision of our retrievals was evaluated by calculating the relative uncertainty of the

X_{C O_{2}}

time series, defined as the standard deviation divided by the mean:

u (X_{C O_{2}}) = (\frac{{std}_{X_{C O_{2}}}}{{mean}_{X_{C O_{2}}}}) \times 100

(10)

where

{std}_{X_{C O_{2}}}

is the standard deviation and

{mean}_{X_{C O_{2}}}

is the statistical mean of the retrieved

X_{C O_{2}}

values over the measurement period. Based on a standard deviation of 0.68 ppm and a mean of 422.04 ppm, this calculation yielded a relative uncertainty of just 0.16%. This high level of precision is particularly notable, as it is even lower than the 0.17–0.5% relative uncertainty range previously reported for similar state-of-the-art LHR systems by Wang et al. (2020) [42]. This represents a notable improvement in retrieval precision over conventional methods that rely on static climatological profiles, which can introduce additional variance and bias. We attribute this enhanced precision to two key innovations introduced in this work. First, the use of dynamic, real-time a priori profiles generated by our random forest model provides a more accurate initial state for the inversion, reducing biases that can arise from using static, climatological profiles. Second, our application of the LOWESS method for baseline correction resulted in a nearly five-fold improvement in the signal-to-noise ratio of the preprocessed spectra. A cleaner input spectrum with less residual noise directly translates to a more stable and precise inversion, as predicted by optimal estimation theory. The stability of our results is further demonstrated by the fact that the retrieved

X_{C O_{2}}

values remained tightly clustered around the mean, with only minor fluctuations, despite variations in the solar zenith angle (and thus the air mass) during the two-hour measurement period. This robustness confirms that our integrated methodology—combining a machine learning-based prior with advanced signal processing—effectively mitigates the primary sources of random error, namely system noise and retrieval instabilities.

In terms of computational efficiency, the random forest-based approach demonstrates significant advantages over traditional climatological models. By leveraging real-time meteorological inputs, the model reduces reliance on static datasets and ensures adaptability to dynamic atmospheric conditions. This contrasts with conventional methods that often require extensive precomputed libraries, leading to higher computational overhead. Moreover, the use of Py4CAtS for radiative transfer modeling ensures compatibility with high-resolution spectroscopic databases like HITRAN and GEISA, further enhancing retrieval precision.

The applications for this high-precision monitoring framework are not limited to atmospheric science but extend into several other disciplines. For example, in agriculture and ecosystem science, continuous, real-time monitoring of column CO₂ offers a powerful way to validate carbon flux models. Deploying LHR instruments over a cornfield or a forest would allow researchers to directly track the daily and seasonal drawdown of atmospheric CO₂, providing a top-down measure of net ecosystem exchange (NEE). This integrated signal, which captures the balance of plant photosynthesis and soil respiration, is invaluable for developing climate-smart farming techniques, improving crop yield forecasts, and verifying the success of carbon sequestration projects. This methodology could be equally transformative for urban planning and public health. A network of LHR instruments spread across a city could map the urban CO₂ dome with high precision, allowing city planners to pinpoint and quantify major emission hotspots like traffic corridors and industrial parks. Such a network would offer near real-time feedback on the effectiveness of new emission reduction policies. And because CO₂ is often co-emitted with other harmful air pollutants, these measurements could also serve as a vital proxy for tracking air quality and informing public health advisories.

Despite the advancements achieved, certain limitations remain. The model is currently region-specific, optimized for the Hefei area, and may require adaptation for use in other geographic locations with differing climatic conditions. Extreme weather or anomalous atmospheric phenomena could also impact the performance of the random forest-based prior profile generation. Future work will focus on developing adaptive models to handle diverse environmental conditions while extending the methodology to include additional greenhouse gases such as CH₄ and N₂O, thereby enhancing its applicability. Further improvements aim to refine measurement accuracy by enabling all points of the prior profile to participate in the inversion process under physical constraints, moving beyond simple scaling factor adjustments.

The implementation of the prior profile generation model, data preprocessing, and baseline-corrected inversion algorithm in Python establishes a robust foundation for operational deployment. These components can be integrated into an end-to-end real-time system for continuous atmospheric monitoring. The methodology is compatible with automated processing, making it suitable for long-term observational studies and climate change research. Future work will incorporate advanced algorithms to enable periodic inversions, offering deeper insights into seasonal and annual variations of atmospheric CO₂ and supporting global atmospheric monitoring efforts.

6. Conclusions

This study introduces a novel methodology for generating atmospheric profiles using a random forest-based model optimized for the Hefei region. Unlike traditional methods that rely on climatological or meteorological model data, our approach utilizes real-time ground-level meteorological parameters as inputs, thereby enhancing both accuracy and computational efficiency in prior profile generation. The integration of high-resolution laser heterodyne radiometer (LHR) spectroscopy, employing a distributed feedback (DFB) laser operating at

1601 nm

, enabled precise measurements of atmospheric CO₂ concentrations. Additionally, the correlation coefficient-based wavenumber calibration method ensured spectral accuracy, a critical factor for reliable atmospheric retrievals.

The retrieved results demonstrated a uniform error distribution with well-constrained uncertainties, validating the precision and robustness of the inversion process. Compared to conventional methods relying on static climatological profiles, our approach reduced uncertainty by approximately 0.175%. Successful inversion analysis of the continuous dataset collected on 22 August 2025, confirmed the system’s capability to handle real-world atmospheric conditions effectively.

Author Contributions

Conceptualization, G.W. and N.F.; methodology, N.F.; software, N.F.; validation, N.F.; formal analysis, N.F.; resources, G.W. and Z.C.; data curation, N.F.; writing—original draft preparation, N.F.; writing—review and editing, G.W.; visualization, N.F. and Z.C.; supervision, G.W. and K.L.; project administration, G.W. and K.L.; funding acquisition, G.W. and X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by National Natural Science Foundation of China (42075128), National Key Research and Development Program of China (2023YFE0207700, 2023YFF0714700), and Tongling major science and technology projects (202401JB006).

Data Availability Statement

The dataset used to build the prior profile model consists of publicly available data from the TCCON Hefei station, which can be accessed at https://data.caltech.edu/records/etz11-jpg19 (accessed on 3 June 2025). The LHR measurement data generated during this study are not publicly available due to ongoing research activities and laboratory internal data sharing protocols. These data are available from the corresponding author on reasonable request.

Acknowledgments

The authors thank the editor and the anonymous reviewers for their valuable comments and helpful suggestions.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Random Forest Hyperparameters

The optimal hyperparameters for each atmospheric profile were obtained via randomized search with cross-validation, as shown in Table A1. The optimization was conducted using RandomizedSearchCV from scikit-learn, with a predefined parameter space for the random forest model parameters. Hyperparameters were selected by minimizing the mean squared error on the validation set. Notably, the max_features parameter was set to the square root of the total number of features, which is a common practice to reduce overfitting and improve generalization in random forest models.

The optimal max_depth varies among different variables, constrained to 20–35 for CO₂, CH₄, N₂O, and CO, suggesting moderate nonlinearity in their predictor relationships. Most models utilize the square root of the total number of features for max_features, and bootstrap sampling is enabled except for the temperature model. The number of estimators ranges from 67 to 268, reflecting differing requirements for model complexity.These optimized parameters were used to train final random forest models for high-resolution profile retrieval.

Table A1. Optimal hyperparameters for random forest regression models of each atmospheric profiles.

	Bootstrap	Max_Depth	Max_Features	Min_Leaf	Min_Split	n_Estimators
temperature	False ¹	30	sqrt	9	18	268
pressure	True	None ²	sqrt	7	10	137
H₂O	True	None	sqrt	7	10	137
CO₂	True	20	sqrt	8	8	171
CH₄	True	20	sqrt	8	8	171
N₂O	True	35	sqrt	9	8	67
CO	True	20	sqrt	8	8	171
O₂	True	None	sqrt	7	10	137

¹ Bootstrap sampling was disabled for tree training. ² No maximum depth(None) constraint was applied for pressure, H₂O, and O₂, suggesting that increased tree depth improves model performance without overfitting.

The parameter search space for the RandomizedSearchCV was defined to cover a wide range of values commonly used in geophysical modeling. For instance, n_estimators (the number of trees in the forest) was sampled from a distribution ranging from 50 to 300, while max_depth explored values from 10 up to allowing the trees to grow to their full depth (None). A 3-fold cross-validation scheme was employed during the search. This means the training data was split into three subsets; for each parameter combination, the model was trained on two of the folds and validated on the remaining one, a process that was repeated three times. This robust validation approach ensures that the selected hyperparameters are not just optimal for a specific train–test split but are likely to perform well on new, unseen data.

Appendix B. Input Weather Parameters for the Prior Profile Model

During the continuous LHR signal acquisition on 22 August 2025, concurrent ground-level meteorological parameters were obtained via API polling from 09:30 to 11:30 local time. These real-time data, detailed in Table A2, form the core input feature set for our predictive model. The feature set includes key atmospheric state variables such as surface temperature (tout), pressure (pout), and relative humidity (hout), as well as observational parameters like the solar zenith angle (solzen), wind speed (wspd), and wind direction (wdir).

To capture the temporal context of the measurements, the timestamp for each data point was decomposed into its constituent hour and month components. These were used directly as numerical features. While more complex encodings such as cyclical transformations are sometimes employed, we found that for a tree-based model like Random Forest, using direct integer values for hour and month was sufficient to capture the necessary diurnal and seasonal patterns within our single-site dataset. The model’s decision tree structure is capable of creating non-linear splits (e.g., “if hour < 6 or hour > 18”) that can effectively learn time-of-day dependencies without explicit cyclical encoding. The final, structured feature set was then fed into the trained model to generate the time-resolved a priori profiles used in our retrieval.

Table A2. Real-time meteorological parameters used as input for the prior profile model during the experiment on 22 August 2025.

Time (hh:mm:ss AM)	Tout (°C)	Pout (hPa)	Hout (%)	Solzen (°)	Wspd (m/s)	Wdir (°)	Hour	Month
09:30:29	33.04	1008	61	42.655	1.3	352	9	8
09:52:11	33.04	1008	61	38.3645	1.3	352	9	8
09:57:32	33.04	1008	61	37.3303	1.3	352	9	8
10:08:56	32.97	1008	62	35.165	1.3	352	10	8
10:12:48	32.97	1008	62	34.444	1.3	352	10	8
10:18:35	33.04	1008	61	33.3799	1.3	352	10	8
10:30:35	33.93	1007	56	31.234	2	18	10	8
10:36:34	33.93	1007	56	30.2	2	18	10	8
10:46:19	33.93	1007	56	28.5753	2	18	10	8
10:52:30	33.93	1007	56	27.589	2	18	10	8
10:57:50	33.93	1007	56	26.7693	2	18	10	8
11:04:20	33.93	1007	56	25.8135	2	18	11	8
11:14:18	33.93	1007	56	24.4524	2	18	11	8

Appendix C. Test Results of the Prior Profile Model

Figure A1 presents the prediction performance of the model on the test set. The predicted profiles (red dashed line) generally follow the overall trend of the true profiles (blue solid line), demonstrating that the model captures the dominant structural features across different altitudes. The figure also includes the Mean Squared Error (MSE) and the coefficient of determination (

R^{2}

) for each model, calculated across the entire test set. However, discrepancies are observed in regions with sharp gradients or complex curvature—particularly in the lower atmosphere and near inflection points—indicating a limited feature learning capacity in highly nonlinear segments. These deviations suggest that while the model performs well in smooth regions, further refinement is needed to improve accuracy in dynamically varying atmospheric layers.

Figure A1. Comparison of true and model-predicted atmospheric profiles for five randomly selected test samples. In each subplot title, the ‘#’ symbol denotes the unique sample number from the dataset. The blue solid line represents the true profiles of the test set, while the red dashed line indicates the model-predicted profiles. The y-axis shows altitude in kilometers.

References

Legg, S. IPCC, 2021: Climate change 2021-the physical science basis. Interaction 2021, 49, 44–45. [Google Scholar]
Le Quéré, C.; Andrew, R.M.; Friedlingstein, P.; Sitch, S.; Hauck, J.; Pongratz, J.; Pickers, P.A.; Korsbakken, J.I.; Peters, G.P.; Canadell, J.G.; et al. Global carbon budget 2018. Earth Syst. Sci. Data 2018, 10, 2141–2194. [Google Scholar] [CrossRef]
Sassen, K. Ground-Based Remote Sensing. In Cirrus; Oxford University Press: New York, NY, USA, 2002; p. 168. [Google Scholar]
Ma, J.; Fan, E.; Liu, H.; Zhang, Y.; Mai, C.; Li, X.; Jin, W.; Guan, B.O. Microscale fiber photoacoustic spectroscopy for in situ and real-time trace gas sensing. Adv. Photonics 2024, 6, 066008. [Google Scholar] [CrossRef]
Sun, H.; He, Y.; Qiao, S.; Liu, Y.; Ma, Y. Highly sensitive and real-simultaneous CH₄/C₂H₂ dual-gas LITES sensor based on Lissajous pattern multi-pass cell. Opto-Electron. Sci. 2024, 3, 240013-1. [Google Scholar] [CrossRef]
Weidmann, D.; Reburn, W.; Smith, K. Ground-based prototype quantum cascade laser heterodyne radiometer for atmospheric studies. Rev. Sci. Instrum. 2007, 78, 073107. [Google Scholar] [CrossRef]
Rodin, A.; Klimchuk, A.; Nadezhdinskiy, A.; Churbanov, D.; Spiridonov, M. High resolution heterodyne spectroscopy of the atmospheric methane NIR absorption. Opt. Express 2014, 22, 13825–13834. [Google Scholar] [CrossRef]
Deng, H.; Yang, C.; Xu, Z.; Li, M.; Huang, A.; Yao, L.; Hu, M.; Chen, B.; He, Y.; Kan, R.; et al. Development of a laser heterodyne spectroradiometer for high-resolution measurements of CO₂, CH₄, H₂O and O₂ in the atmospheric column. Opt. Express 2021, 29, 2003–2013. [Google Scholar] [CrossRef]
Menzies, R.T.; Seals, R.K., Jr. Ozone monitoring with an infrared heterodyne radiometer. Science 1977, 197, 1275–1277. [Google Scholar] [CrossRef]
Frerking, M.A.; Muehlner, D.J. Infrared heterodyne spectroscopy of atmospheric ozone. Appl. Opt. 1977, 16, 526–528. [Google Scholar] [CrossRef]
Tsai, T.R.; Rose, R.A.; Weidmann, D.; Wysocki, G. Atmospheric vertical profiles of O₃, N₂O, CH₄, CCl₂F₂, and H₂O retrieved from external-cavity quantum-cascade laser heterodyne radiometer measurements. Appl. Opt. 2012, 51, 8779–8792. [Google Scholar] [CrossRef]
Wilson, E.; McLinden, M.; Miller, J.; Allan, G.; Ott, L.; Melroy, H.; Clarke, G. Miniaturized laser heterodyne radiometer for measurements of CO₂ in the atmospheric column. Appl. Phys. B 2014, 114, 385–393. [Google Scholar] [CrossRef]
Wang, J.; Tu, T.; Zhang, F.; Shen, F.; Xu, J.; Cao, Z.; Gao, X.; Plus, S.; Chen, W. External-cavity diode laser-based near-infrared broadband laser heterodyne radiometer for remote sensing of atmospheric CO₂. Opt. Express 2023, 31, 9251–9263. [Google Scholar] [CrossRef] [PubMed]
Rodgers, C.D. Inverse Methods for Atmospheric Sounding: Theory and Practice; World Scientific: Singapore, 2000; Volume 2. [Google Scholar]
Connor, B.J.; Boesch, H.; Toon, G.; Sen, B.; Miller, C.; Crisp, D. Orbiting Carbon Observatory: Inverse method and prospective error analysis. J. Geophys. Res. Atmos. 2008, 113, D05305. [Google Scholar] [CrossRef]
Zhou, M.; Langerock, B.; Sha, M.K.; Kumps, N.; Hermans, C.; Petri, C.; Warneke, T.; Chen, H.; Metzger, J.M.; Kivi, R.; et al. Retrieval of atmospheric CH 4 vertical information from ground-based FTS near-infrared spectra. Atmos. Meas. Tech. 2019, 12, 6125–6141. [Google Scholar] [CrossRef]
Zheng, L.; Lin, R.; Wang, X.; Chen, W. The development and application of machine learning in atmospheric environment studies. Remote Sens. 2021, 13, 4839. [Google Scholar] [CrossRef]
Smith, N.; Barnet, C.D. CLIMCAPS observing capability for temperature, moisture, and trace gases from AIRS/AMSU and CrIS/ATMS. Atmos. Meas. Tech. 2020, 13, 4437–4459. [Google Scholar] [CrossRef]
Fujita, M.; Sugiura, N.; Kouketsu, S. Prediction of atmospheric profiles with machine learning using the signature method. Geophys. Res. Lett. 2024, 51, e2023GL106403. [Google Scholar] [CrossRef]
Schreier, F.; Gimeno García, S.; Hochstaffl, P.; Städt, S. Py4cats—PYthon for computational ATmospheric spectroscopy. Atmosphere 2019, 10, 262. [Google Scholar] [CrossRef]
Connor, B.; Bösch, H.; McDuffie, J.; Taylor, T.; Fu, D.; Frankenberg, C.; O’Dell, C.; Payne, V.H.; Gunson, M.; Pollock, R.; et al. Quantification of uncertainties in OCO-2 measurements of XCO 2: Simulations and linear error analysis. Atmos. Meas. Tech. 2016, 9, 5227–5238. [Google Scholar] [CrossRef]
Wunch, D.; Toon, G.C.; Blavier, J.F.L.; Washenfelder, R.A.; Notholt, J.; Connor, B.J.; Griffith, D.W.; Sherlock, V.; Wennberg, P.O. The total carbon column observing network. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2011, 369, 2087–2112. [Google Scholar] [CrossRef]
Laughner, J.L.; Toon, G.C.; Mendonca, J.; Petri, C.; Roche, S.; Wunch, D.; Blavier, J.F.; Griffith, D.W.; Heikkinen, P.; Keeling, R.F.; et al. The total carbon column observing network’s GGG2020 data version. Earth Syst. Sci. Data 2024, 16, 2197–2260. [Google Scholar] [CrossRef]
Skrobot, V.L.; Castro, E.V.; Pereira, R.C.; Pasa, V.M.; Fortes, I.C. Use of principal component analysis (PCA) and linear discriminant analysis (LDA) in gas chromatographic (GC) data in the investigation of gasoline adulteration. Energy Fuels 2007, 21, 3394–3400. [Google Scholar] [CrossRef]
Di Paola, F.; Ricciardelli, E.; Cimini, D.; Cersosimo, A.; Di Paola, A.; Gallucci, D.; Gentile, S.; Geraldi, E.; Larosa, S.; Nilo, S.T.; et al. MiRTaW: An algorithm for atmospheric temperature and water vapor profile estimation from ATMS measurements using a random forests technique. Remote Sens. 2018, 10, 1398. [Google Scholar] [CrossRef]
Wager, S.; Walther, G. Adaptive concentration of regression trees, with application to random forests. arXiv 2015, arXiv:1503.06388. [Google Scholar]
Xia, S.; Wang, G.; Chen, Z.; Duan, Y. Complete random forest based class noise filtering learning for improving the generalizability of classifiers. IEEE Trans. Knowl. Data Eng. 2018, 31, 2063–2078. [Google Scholar] [CrossRef]
Huang, K.; Xiao, Q.; Meng, X.; Geng, G.; Wang, Y.; Lyapustin, A.; Gu, D.; Liu, Y. Predicting monthly high-resolution PM2.5 concentrations with random forest model in the North China Plain. Environ. Pollut. 2018, 242, 675–683. [Google Scholar] [CrossRef]
Dewi, C.; Chen, R.C. Integrating real-time weather forecasts data using OpenWeatherMap and Twitter. Int. J. Inf. Technol. Bus. 2019, 1, 48–52. [Google Scholar]
Sun, C.; He, X.; Zhang, K.; Bai, J.; Liu, X. Simultaneous detection of multi-component greenhouse gases based on an all-fibered near-infrared single-channel frequency-division multiplexing wavelength-modulated laser heterodyne radiometer. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2023, 293, 122434. [Google Scholar] [CrossRef]
Li, J.; Xue, Z.; Shen, F.; Wang, J.; Li, Y.; Wang, G.; Liu, K.; Chen, W.; Gao, X.; Tan, T. Erbium-doped fiber amplifier (EDFA)-assisted laser heterodyne radiometer (LHR) working in the shot-noise-dominated regime. Opt. Lett. 2023, 48, 5229–5232. [Google Scholar] [CrossRef]
Wang, J.; Zhang, F.; Xue, Z.; Gao, X.; Liang, M.; Cai, Z.; Zhou, M.; Tan, T. Ultrahigh Resolution Portable Dual-Channel Laser Heterodyne Radiometer. IEEE Trans. Instrum. Meas. 2024, 73, 1–9. [Google Scholar] [CrossRef]
Meng, Y.; Huang, Y.; Cao, Z.; Lu, X.; Shou, H.; Huang, J. Measurement and inversion of water vapor in Dalian by high resolution non-modulated laser heterodyne radiometer. In Proceedings of the AOPC 2024: Atmospheric and Environmental Optics, SPIE, Beijing, China, 23–26 July 2024; Volume 13504, pp. 70–75. [Google Scholar]
Kuze, A.; Suto, H.; Nakajima, M.; Hamazaki, T. Thermal and near infrared sensor for carbon observation Fourier-transform spectrometer on the Greenhouse Gases Observing Satellite for greenhouse gases monitoring. Appl. Opt. 2009, 48, 6716–6733. [Google Scholar] [CrossRef]
Rodgers, C. Statistical principles of inversion theory. In Inversion Methods in Atmospheric Remote Sounding; Academic Press: New York, NY, USA, 1977; pp. 117–134. [Google Scholar]
Lu, X.; Huang, Y.; Wu, P.; Liu, D.; Ma, H.; Wang, G.; Cao, Z. Distributed feedback interband cascade laser based laser heterodyne radiometer for column density of HDO and CH₄ measurements at Dunhuang, Northwest of China. Remote Sens. 2022, 14, 1489. [Google Scholar] [CrossRef]
Li, J.; Xue, Z.; Li, Y.; Bo, G.; Shen, F.; Gao, X.; Zhang, J.; Tan, T. Real-time measurement of atmospheric CO₂, CH₄ and N₂O above rice fields based on Laser Heterodyne Radiometers (LHR). Agronomy 2023, 13, 373. [Google Scholar] [CrossRef]
Xue, Z.; Shen, F.; Li, J.; Liu, X.; Wang, J.; Wang, G.; Liu, K.; Chen, W.; Gao, X.; Tan, T. A MEMS modulator-based dual-channel mid-infrared laser heterodyne radiometer for simultaneous remote sensing of atmospheric CH₄, H₂O and N₂O. Opt. Express 2022, 30, 31828–31839. [Google Scholar] [CrossRef] [PubMed]
Xia, T.; Liu, J.; Liu, Z.; Yue, F.; Yang, F.; Zhu, X.; Chen, W. Simulation and performance evaluation of laser heterodyne spectrometer based on CO₂ absorption cell. Remote Sens. 2023, 15, 788. [Google Scholar] [CrossRef]
Fonseca-Campos, J.; Fonseca-Ruiz, L.; Cortez-Herrera, P.N. Portable system for the calculation of the sun position based on a laptop, a GPS and Python. In Proceedings of the 2016 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC), Ixtapa, Mexico, 9–11 November 2016; pp. 1–5. [Google Scholar]
Zwickl, R.; Doggett, K.; Sahm, S.; Barrett, W.; Grubb, R.; Detman, T.; Raben, V.; Smith, C.; Riley, P.; Gold, R.; et al. The NOAA real-time solar-wind (RTSW) system using ACE data. Space Sci. Rev. 1998, 86, 633–648. [Google Scholar] [CrossRef]
Wang, J.; Sun, C.; Wang, G.; Zou, M.; Tan, T.; Liu, K.; Chen, W.; Gao, X. A fibered near-infrared laser heterodyne radiometer for simultaneous remote sensing of atmospheric CO₂ and CH₄. Opt. Lasers Eng. 2020, 129, 106083. [Google Scholar] [CrossRef]

Figure 1. Schematic of the training and prediction of the prior profile model.

Figure 2. Preset thresholds (red dashed line) and the minimum number of selected principal components (green dashed line) for eight types of profiles. The x-axis represents the number of principal components selected, ranging from 1 to 20.

Figure 3. Mean square error (MSE) distribution on the cross-validation set for different hyperparameter configurations. The y-axis displays negative MSE (Neg MSE), where higher values indicate better model performance (lower actual MSE).

Figure 4. Predictive performance of the random forest model for principal component scores. The red dashed line represents the ideal one-to-one relationship (

y = x

). Closer alignment of data points to this diagonal indicates higher predictive accuracy for each profile constituent.

Figure 4. Predictive performance of the random forest model for principal component scores. The red dashed line represents the ideal one-to-one relationship (

y = x

). Closer alignment of data points to this diagonal indicates higher predictive accuracy for each profile constituent.

Figure 5. Schematic of the LHR experimental setup. Abbreviations are defined as follows: SMF, single-mode fiber; FS, fiber splitter; PD, photodetector; FC, fiber coupler; Bias-T, bias tee; AMP, amplifier; BPF, band-pass filter; SD, Schottky diode; DAQ, data acquisition card; LIA, lock-in amplifier; PC, personal computer; LDC, laser diode controller; DFB, distributed feedback laser.

Figure 6. An example of raw signals from a single scan on 22 August 2025. (a) LHR signals after phase compensation; (b) DC components for monitoring local oscillator power; (c) Signal amplitude for tracking solar power.

Figure 7. Comparison of LOWESS and polynomial methods for baseline correction. (a) Identification of non-absorption regions for the LOWESS baseline fit (for

frac = 0.05

). (b) Signal after normalization and baseline correction using LOWESS with varying frac values. (c) Identification of non-absorption regions for the second-order polynomial fit. (d) Signal after normalization and baseline correction using polynomial.

Figure 7. Comparison of LOWESS and polynomial methods for baseline correction. (a) Identification of non-absorption regions for the LOWESS baseline fit (for

frac = 0.05

). (b) Signal after normalization and baseline correction using LOWESS with varying frac values. (c) Identification of non-absorption regions for the second-order polynomial fit. (d) Signal after normalization and baseline correction using polynomial.

Figure 8. Conversion of a simulated vertical transmittance spectrum to a slant-path spectrum. The conversion uses a solar zenith angle of 34.44°, calculated for Hefei at 09:52:11 on 22 August 2025.

Figure 9. The effect of the Instrument Line Shape (ILS) on the simulated transmittance spectrum. (a) The ILS profile used for CO₂ retrieval (225–400 MHz); (b) The simulated CO₂ transmittance after convolution with the ILS.

Figure 10. Atmospheric profiles for data retrieval on 22 August 2025: (a) Temperature; (b) Pressure; (c) CO₂; (d) H₂O. These profiles were generated from the model using ground-level meteorological data for that day.

Figure 11. Example of CO₂ retrieval results from 22 August 2025. (a) Preprocessed (red line), baseline (green dotted line), and fitted (blue line) LHR spectra with the iteration process (inset); (b) Residuals.

Figure 12. Convergence of the quadratic baseline parameters during the iterative retrieval process: (a) constant term, (b) linear term (slope), and (c) quadratic term.

Figure 13. Time series of retrieved atmospheric

X_{C O_{2}}

and the corresponding solar zenith angle from 9:30 to 11:30 on 22 August 2025.

Figure 13. Time series of retrieved atmospheric

X_{C O_{2}}

and the corresponding solar zenith angle from 9:30 to 11:30 on 22 August 2025.

Table 1. Performance metrics (MSE and

R^{2}

) of the predictive model on training and test sets.

Table 1. Performance metrics (MSE and

R^{2}

) of the predictive model on training and test sets.

Profile	Training Set		Test Set
Profile	MSE	$R^{2}$	MSE	$R^{2}$
Temperature	0.0582	0.97	0.0897	0.95
Pressure	0.3141	0.90	0.2894	0.90
H₂O	0.0940	0.93	0.1346	0.90
CO₂	0.2878	0.94	0.4032	0.92
$C H_{4}$	0.1111	0.94	0.1649	0.90
N₂O	0.1742	0.92	0.2511	0.89
CO	0.1054	0.94	0.1365	0.92
O₂	0.0658	0.91	0.0923	0.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, N.; Chen, Z.; Liu, K.; Gao, X.; Wang, G. Improved Data Processing and a Prior Profile Generation Method for Precise Retrieval of Atmospheric CO₂ Based on a Laser Heterodyne Radiometer. Remote Sens. 2025, 17, 3791. https://doi.org/10.3390/rs17233791

AMA Style

Fu N, Chen Z, Liu K, Gao X, Wang G. Improved Data Processing and a Prior Profile Generation Method for Precise Retrieval of Atmospheric CO₂ Based on a Laser Heterodyne Radiometer. Remote Sensing. 2025; 17(23):3791. https://doi.org/10.3390/rs17233791

Chicago/Turabian Style

Fu, Nianna, Zhao Chen, Kun Liu, Xiaoming Gao, and Guishi Wang. 2025. "Improved Data Processing and a Prior Profile Generation Method for Precise Retrieval of Atmospheric CO₂ Based on a Laser Heterodyne Radiometer" Remote Sensing 17, no. 23: 3791. https://doi.org/10.3390/rs17233791

APA Style

Fu, N., Chen, Z., Liu, K., Gao, X., & Wang, G. (2025). Improved Data Processing and a Prior Profile Generation Method for Precise Retrieval of Atmospheric CO₂ Based on a Laser Heterodyne Radiometer. Remote Sensing, 17(23), 3791. https://doi.org/10.3390/rs17233791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Improved Data Processing and a Prior Profile Generation Method for Precise Retrieval of Atmospheric CO₂ Based on a Laser Heterodyne Radiometer

Highlights

Abstract

1. Introduction

2. Methodology for Generating Prior Atmospheric Profiles

3. Data Processing

4. Data Retrieval

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Random Forest Hyperparameters

Appendix B. Input Weather Parameters for the Prior Profile Model

Appendix C. Test Results of the Prior Profile Model

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI