Next Article in Journal
Estimating Grazing Pressure from Satellite Time Series Without Reliance on Total Production
Previous Article in Journal
Monitoring Sand Dune Height Change in Kubuqi Desert Based on a Bistatic InSAR-Measured DEM Differential Method
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Framework Integrating Spectrum Analysis and AI for Near-Ground-Surface PM2.5 Concentration Estimation

1
Institutes of Physical Science and Information Technology, Anhui University, Hefei 230039, China
2
State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology, Anhui University, Hefei 230039, China
3
Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei 230601, China
4
Key Lab of Environmental Optics & Technology, Anhui Institute of Optics and Fine Mechanics, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China
5
Department of Precision Machinery and Precision Instrumentation, University of Science and Technology of China, Hefei 230026, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(22), 3780; https://doi.org/10.3390/rs17223780
Submission received: 12 September 2025 / Revised: 14 November 2025 / Accepted: 19 November 2025 / Published: 20 November 2025
(This article belongs to the Section Atmospheric Remote Sensing)

Highlights

What are the main findings?
  • Direct estimation is performed of near-ground-surface PM2.5 concentrations by integrating spectral analysis and machine learning approaches.
  • PM2.5 measurement is achieved using a single MAX-DOAS instrument without reliance on meteorological data.
What is the implication of the main finding?
  • Horizontal spatial distribution characteristics of PM2.5 within urban areas are identified.
  • Synchronous observations from two MAX-DOAS instruments reveal intra-urban PM2.5 pollution transport processes.

Abstract

Monitoring the horizontal distribution of PM2.5 within urban areas is of great significance, not only for environmental management but also for providing essential data to understand the distribution, formation, transport, and transformation of PM2.5 within cities. This study proposes a novel approach—the Spectral Analysis-based PM2.5 Estimation Machine Learning (SAPML) model. This method uses a machine learning model trained with features derived from multi-azimuth and multi-elevation MAX-DOAS observations, specifically the oxygen dimer (O4) differential slant column densities (O4 dSCDs), and labels provided by near-surface ground measurements corresponding to each azimuthal direction, to estimate near-surface PM2.5 concentrations. This approach does not rely on meteorological data and enables multi-directional near-surface PM2.5 monitoring using only a single independent instrument. SAPML bypasses the intermediate retrieval of aerosol extinction coefficients and directly estimates PM2.5 concentrations from spectral analysis results, thereby avoiding the accumulation of errors. Using O4 dSCD data from multiple MAX-DOAS stations for model training eliminates inter-station conversion differences, allowing a single model to be applied across multiple sites. Station-based k-fold cross-validation yielded an average Pearson correlation coefficient (R) of 0.782, demonstrating the robustness and transferability of the method across major regions in China. Among the machine learning algorithms evaluated, Extreme Gradient Boosting (XGBoost) exhibited the best performance. Feature optimization based on importance ranking reduced data collection time by approximately 30%, while the correlation coefficient (R) of the estimation results decreased by only about 1.3%. The trained SAPML model was further applied to two MAX-DOAS stations in Hefei, HF-HD, and HFC, successfully resolving the near-surface PM2.5 spatial distribution at both sites. The results revealed clear intra-urban heterogeneity, with higher PM2.5 concentrations observed in the western industrial park area. During the same observation period, an east-to-west PM2.5 pollution transport event was captured: PM2.5 increases were first detected in the upwind direction at HF-HD, followed by the downwind direction at the same station, and finally at the downwind station HFC. These results indicate that the SAPML model is an effective approach for monitoring intra-urban PM2.5 distributions.

1. Introduction

PM2.5, defined as fine particulate matter with an aerodynamic diameter smaller than 2.5 µm, exerts profound impacts on air quality, human health, and climate change. Numerous studies have demonstrated its association with increased morbidity and mortality, as well as its role in atmospheric radiative forcing [1,2,3]. Long-term exposure to PM2.5 is associated with diseases and premature mortality caused by conditions such as heart disease, lung cancer, chronic obstructive pulmonary disease (COPD), stroke, type 2 diabetes, lower respiratory infections (LRIs) such as pneumonia, and adverse birth outcomes including preterm birth and low birth weight [4,5,6]. In 2021, approximately 630,000 deaths in Southeast Asia were attributable to air pollution. South Asia is home to some of the world’s most populous countries and is among the regions bearing the highest disease burden from air pollution, with nearly 2.7 million deaths in 2021. In China, air pollution was also responsible for an estimated 1.86 million deaths in the same year [7]. In India, approximately 35% of all deaths are linked to air pollution [8]. Within urban areas, complex terrain, diverse emission sources, and uneven population density lead to significant spatial variability in near-surface PM2.5 concentrations. When assessments rely on single-direction and low-temporal-resolution monitoring, such variability may introduce exposure bias. This underscores the importance of obtaining multi-directional and high-temporal-resolution observations of intra-urban PM2.5 distributions. Such monitoring not only reduces exposure bias but also provides critical scientific insights into the horizontal distribution characteristics of PM2.5, which are essential for developing effective and targeted air pollution control strategies.
Currently, in situ measurements remain the primary approach for monitoring PM2.5 concentrations, owing to their high accuracy and reliability, and they are often regarded as the reference standard for validating other measurement methods [9,10]. Common automated instruments include Beta Attenuation Monitors (BAMs), Tapered Element Oscillating Microbalances (TEOMs), and optical scattering sensors [11,12,13,14]. The China National Environmental Monitoring Center (CNEMC) has established a nationwide air quality monitoring network utilizing BAM and TEOM devices, providing hourly PM2.5 observations [15,16]. Despite their high precision, the limited number of monitoring stations restricts spatial coverage, particularly in suburban and mountainous areas. In addition to conventional mass concentration measurements, some in situ instruments also target the optical properties and size distributions of aerosols. For instance, particle size spectrometers (such as GRIMM) can provide size-resolved concentrations of PM1.0, PM2.5, and PM10 [17], while aethalometers measure aerosol absorption coefficients to estimate black carbon mass concentrations [18,19]. These optical parameters are particularly valuable for climate research and source apportionment studies.
To overcome limitations in spatial coverage, monitoring instruments have also been deployed on mobile platforms [20]. For example, Yu et al. operated more than 300 mobile monitoring vehicles equipped with laser scattering sensors and GPS modules in Jinan during December 2019 and July 2020. This campaign achieved 30 m spatial resolution and minute-level temporal resolution, revealing localized pollution hotspots in traffic-dense areas [21]. Moreover, the advent of portable monitoring devices has enabled public participation in crowdsourced air quality monitoring [22]. However, as a type of low-cost sensor (LCS), laser scattering sensors are sensitive to humidity and prone to drift, resulting in relatively lower accuracy compared with conventional in situ measurements [23].
Conventional ground-based monitoring networks provide reliable measurements of PM2.5 concentrations but offer limited spatial representativeness. Remote sensing techniques, while unable to directly measure particulate matter, provide aerosol optical properties and thus possess distinct advantages in expanding spatial coverage [24]. Among satellite remote sensing instruments, MODIS (Moderate Resolution Imaging Spectroradiometer) provides global aerosol optical depth (AOD) products with typical spatial resolutions of 1–10 km and a revisit period of about 1–2 days [25,26]. CALIPSO (Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation) uses lidar to detect aerosol vertical profiles with a horizontal spatial resolution of up to 333 m but has a longer revisit period of approximately 16 days [27]. Himawari-8, a geostationary meteorological satellite, enables rapid observations over East Asia every 10 min, delivering high-temporal-resolution cloud and aerosol information [28]. However, satellite remote sensing only retrieves column-integrated optical information and cannot directly reflect near-ground-surface PM2.5 concentrations. Ground-based remote sensing systems include Multi-Axis Differential Optical Absorption Spectroscopy (MAX-DOAS), sun photometers (e.g., Aerosol Robotic Network (AERONET) and the Campaign on Atmospheric Aerosol Research Network of China (CARSNET)), and lidar. Networks such as AERONET and CARSNET, based on direct solar irradiance measurements, provide key aerosol parameters such as AOD and single-scattering albedo (SSA) [29]. Lidar systems can observe the vertical structure of near-ground-surface aerosols [30]. MAX-DOAS collects ultraviolet/visible spectra at multiple elevation angles and, combined with gas absorption cross-sections, retrieves boundary-layer aerosol column concentrations, extinction coefficients, and boundary-layer height [31]. Wagner et al. first proposed the use of MAX-DOAS observations of the oxygen dimer (O4) to retrieve atmospheric aerosol profiles. They found that MAX-DOAS O4 observations are highly sensitive, capable of detecting aerosol extinction coefficients below 0.001 [32]. Irie et al. conducted ground-based MAX-DOAS observations in Tsukuba, Japan, and retrieved vertical profiles of aerosol optical depth and extinction coefficients at 476 nm in the lower troposphere using differential optical absorption spectroscopy and optimal estimation methods [33]. Nevertheless, MAX-DOAS aerosol retrievals are strongly nonlinear and provide only extinction coefficients rather than direct PM2.5 concentrations. Moreover, the conversion from extinction coefficients to PM2.5 is weakly linear and highly region-dependent, with significant variations in slope and intercept across different locations [34,35]. These remote sensing techniques have greatly expanded the spatial monitoring capability for PM2.5, but they can only measure the optical parameters of aerosols, which must be converted into PM2.5 concentrations through other approaches.
Traditional approaches typically convert aerosol optical parameters such as aerosol optical depth (AOD) into PM2.5 concentrations using simple linear regression models. In these models, statistical relationships are established between ground-based PM2.5 concentrations and collocated AOD observations. Based on data from the CARSNET, Xin et al. conducted generalized linear regression analyses of PM2.5 and AOD over regions with high aerosol loadings. Their results demonstrated significant spatial consistency and correlation between PM2.5 and AOD. However, the linear regression functions exhibited considerable variations across regions and seasons, thereby limiting the generalizability of this approach [36]. Beyond linear regression, another class of methods incorporates global chemical transport models (e.g., GEOS-Chem). Van Donkelaar et al. proposed a multi-source fusion inversion method integrating satellite AOD, meteorological variables, and global chemical model simulations to generate global annual mean PM2.5 concentration datasets for 1998–2016. This dataset has been widely adopted internationally [37]. Nevertheless, such methods remain constrained by model parameter settings and the completeness of input data, making it difficult to capture the rapidly evolving spatiotemporal patterns of air pollution [38].
In recent years, artificial intelligence (AI) technologies have been widely applied in the field of environmental remote sensing, particularly in studies integrating AI with MAX-DOAS ground-based observations. For instance, Wang et al. employed MAX-DOAS combined with machine learning techniques to retrieve ozone concentration profiles, achieving temporal resolutions on the order of minutes and vertical resolutions on the order of hundreds of meters [39]. Tian et al., on the other hand, utilized MAX-DOAS in conjunction with convolutional neural networks (CNNs) to predict tropospheric NO2 concentration profiles [40]. However, particulate matter differs significantly from gases in terms of physical properties: the interaction between light and gases is primarily absorption, whereas light–particle interactions involve both absorption and scattering. This introduces strong nonlinear effects when attempting to retrieve PM2.5 concentrations from MAX-DOAS spectral observations.
With the rapid development of AI, especially machine learning and deep neural networks, these techniques have demonstrated excellent capabilities for modeling the nonlinear relationships in PM2.5 concentration estimation [41,42,43]. Li et al. proposed a geographically and temporally weighted neural network constrained by global training (GC-GTWNN) for satellite-based surface PM2.5 estimation. This model was tested in a case study over China, integrating satellite aerosol optical depth, surface PM2.5 observations, and auxiliary variables. Cross-validation yielded an R2 of 0.80 [44]. Karimian et al. employed multiple additive regression trees (MARTs), deep feedforward neural networks (DFNNs), and a novel hybrid model based on long short-term memory (LSTM). Their model explained 80% of PM2.5 variability (R2 = 0.8) and predicted 75% of pollution events [45]. AI-based methods typically rely on multi-source data inputs, including AOD, meteorological fields, geographic features, and ground observations. When data completeness is sufficient, these methods can provide high-accuracy PM2.5 concentration estimates with continuous spatiotemporal coverage. Temporal resolution can reach hourly levels, and spatial resolution can be as fine as 1 km [46]. However, the reliance on meteorological and other auxiliary data limits the ability of traditional artificial intelligence (AI) methods to provide estimates of PM2.5 concentrations. Moreover, to date, none of the AI-based approaches has directly converted O4 observations from MAX-DOAS measurements into PM2.5 concentrations; instead, most methods first derive aerosol extinction coefficients and subsequently convert them into PM2.5 concentrations.
This study proposes a Spectral Analysis-based PM2.5 Estimation Machine Learning (SAPML) model, which integrates MAX-DOAS observations, spectral analysis, and machine learning techniques. The method enables direct retrieval of near-surface PM2.5 concentrations from O4 dSCD data, bypassing the intermediate step of aerosol extinction coefficient inversion and thus avoiding cumulative errors. The SAPML model possesses certain spatial extrapolation capabilities and is applicable to multiple regions in China. This study optimizes the number of viewed angles and shortens the observation cycle specifically for near-ground-surface PM2.5 monitoring. To enable independent monitoring with a single instrument, commonly used meteorological data were excluded from the machine learning model, allowing rapid PM2.5 monitoring based solely on observations from a single MAX-DOAS instrument. The SAPML model integrates MAX-DOAS observations from different viewing directions to obtain the horizontal spatial distribution of PM2.5 around the monitoring sites. When applied to multiple sites in Hefei, SAPML provided the horizontal PM2.5 distribution across the city and successfully captured a pollution transport event.

2. Materials and Methods

2.1. O4 dSCD Data

The concentration of oxygen dimer (O4) can be easily identified because it is proportional to the square of the oxygen concentration. The vertical column density and profile of atmospheric oxygen are mainly influenced by temperature and pressure. The O4 dSCD is the path integral of O4 concentration along the light path. Besides observation time and viewing direction, aerosol properties are the main factors affecting O4 dSCD values at different elevation angles. Aerosol-induced light scattering alters photon paths, impacting the O4 dSCD. Therefore, measurements of O4 contain embedded aerosol information. O4 has four main absorption bands in the ultraviolet–visible spectrum, approximately at 360.8, 477.1, 577.1, and 630.8 nm [47]. To address the sensitivity issue associated with fine particles, this study conducted spectral analysis using the wavelength band centered at 360.8 nm.
This study utilized five MAX-DOAS instruments located at Beijing Academy of Atmospheric Sciences (BJ-CAM), Lanzhou University (LZU), Anhui University (AHU), Hefei University (HFC), and Hefei Hengda Central Plaza (HF-HD). Among these, data from the BJ-CAM, LZU, and AHU sites were used for model training. To prevent the model from becoming overly dependent on the observation azimuth angle during the estimation process, the viewing directions of these three instruments were aligned as closely as possible toward their corresponding national monitoring stations. Due to installation constraints of the MAX-DOAS instruments, the AHU site exhibited an azimuth deviation of approximately 64° from its corresponding station, though its viewing direction generally faced east. Detailed information on the geographic coordinates, observation periods, sequences of elevation angles, and observation azimuth directions (spectral acquisition directions) of these five MAX-DOAS sites is provided in Table 1. In the observation azimuth directions, 0° corresponds to due north, with angles increasing clockwise such that 90° corresponds to due east. Figure 1a,b show the geographic locations of the five MAX-DOAS instruments.
The MAX-DOAS instruments used in this study consist of three main components: a spectrometer, a telescope unit, and a control computer. The spectrometer is an AvaSpec-ULS2048L-USB (Avantes, The Netherlands), covering a wavelength range of 296–408 nm in the ultraviolet spectrum. At a fixed temperature of 20 °C, the spectral resolution is 0.45 nm. The telescope collects solar radiation scattered by various particles and gases in the atmosphere, which is then directed to the spectrometer through a prism mirror and quartz fiber. The elevation angle (α) accuracy is better than 0.1°, with a telescope field of view of 0.3°. Additionally, the system is equipped with a charge-coupled device (CCD) detector camera (Sony ILX511) containing 2048 individual pixels. In this study, a complete measurement sequence consists of 11 elevation angles, 1°, 2°, 3°, 4°, 5°, 6°, 8°, 10°, 15°, 30°, and 90°, with each sequence taking approximately 15 min to complete. The measured MAX-DOAS spectra were analyzed using the QDOAS software (version 3.6.5) developed by the Belgian Institute for Space Aeronomy (BIRA-IASB) (http://uv-vis.aeronomie.be/software/QDOAS/, last accessed on 5 September 2025), which retrieves O4 differential slant column densities (dSCDs) at different elevation angles [48]. The detailed parameter configuration of the QDOAS software (version 3.6.5) used in this study is provided in Appendix A.

2.2. PM2.5 in Situ Measurement

The China National Environmental Monitoring Center (CNEMC) is part of a national-level automatic air quality monitoring network that is uniformly planned, constructed, and operated by the Ministry of Ecology and Environment of China. It is primarily used for real-time monitoring of urban air quality, data release, policy formulation, and environmental protection decision-making. CNEMC stations conduct real-time monitoring and reporting of the six criteria air pollutants: PM2.5, PM10, SO2, NO2, CO, and O3. Among these, PM2.5 is mainly measured using BAM and TEOM.
Figure 1 shows the in situ PM2.5 observation data used in this study, consisting of hourly ground-level PM2.5 concentrations from three stations of the China National Environmental Monitoring Center (CNEMC) during 2018–2023: Beijing Guanyuan station (Index: 1006A) (116.339°E, 39.9295°N), Lanzhou Railway Design Institute station (Index: 1479A) (103.831°E, 36.0464°N), and Hefei Mingzhu Square station (Index: 1270A) (117.196°E, 31.7848°N). The Beijing Guanyuan station (1006A) is located approximately 2.56 km from the BJ-CAM instrument, the Lanzhou Railway Design Institute station (1479A) is about 1.78 km from the LZU instrument, and the Hefei Mingzhu Square station (1270A) is about 2.12 km from the AHU instrument.

2.3. Auxiliary Data

MAX-DOAS observations are highly sensitive to cloud interference, and thus it is necessary to screen the data to exclude measurements affected by clouds. In this study, two types of auxiliary data were used for cloud screening. AERONET is a global ground-based aerosol remote sensing network jointly operated by NASA and multiple international partners. It provides long-term, stable, and well-calibrated observations of atmospheric aerosols, including aerosol optical depth (AOD), size distribution, single-scattering albedo (SSA), aerosol types, and absorption properties. AERONET observations are conducted every 15 min and are only valid under cloud-free conditions. Level 1.5 AOD data are unavailable under cloudy conditions. Therefore, MAX-DOAS O4 dSCD data were temporally matched with AERONET Level 1.5 AOD data at a 15 min precision. Data that failed to match were considered affected by cloud shading and excluded. While Beijing had an operational AERONET site from 2018 to 2023, Lanzhou and Hefei lacked such sites during this period, necessitating alternative data sources for cloud screening in these regions. ERA5, developed by the European Centre for Medium-Range Weather Forecasts (ECMWF), is a global high-resolution reanalysis meteorological dataset widely used in meteorology, climate, environmental, and atmospheric sciences. ERA5 total cloud cover (TCC) represents the fraction of the sky covered by clouds, ranging from 0 (clear sky) to 1 (fully overcast), and is a key parameter influencing surface radiation flux, temperature, and precipitation [49]. ERA5 provides high spatial resolution (0.25° × 0.25°) and long-term temporal coverage (from 1979 to present), offering stable TCC data for cities such as Beijing, Lanzhou, and Hefei, making it a feasible option for cloud screening. However, compared with direct observations, ERA5 TCC is probabilistic and less precise. Therefore, screening results from ERA5 were compared with those from AERONET to calibrate ERA5 performance and approximate the accuracy of AERONET-based cloud screening.
We used AERONET Level 1.5 AOD data from the Beijing site from 15 April 2018 to 26 May 2023 and ERA5 TCC data for Beijing, Lanzhou, and Hefei over the 2018–2023 period.

2.4. Data Processing

In this study, retrievals with RMS greater than 2.0 × 10−3 were excluded during spectral fitting to improve the quality of the O4 dSCD data [50]. However, this criterion alone was insufficient, and additional cloud screening of the O4 dSCD data was required.
As solar radiation received by MAX-DOAS instruments is easily blocked by clouds, the quality of the measured spectra can degrade. Therefore, cloud screening of O4 dSCD data is necessary. ERA5 total cloud cover (TCC) data are used to perform cloud screening on O4 dSCD measurements. For each O4 dSCD observation, the corresponding TCC value is compared against a preset screening threshold. Data with TCC values below the threshold are excluded. TCC represents the cloud fraction, defined as the proportion of the sky covered by clouds. This fraction reflects the probability that the sun is obscured by clouds at the observation site. A higher cloud fraction means a higher probability of solar obscuration. When the cloud fraction equals 1, the sun is completely blocked. When it is 0, the sun is unobstructed by clouds. Theoretically, removing all O4 dSCD measurements with nonzero cloud fraction would be the most accurate approach. However, at low cloud coverage, the sun may still be unobstructed despite a nonzero cloud fraction, leading to excessive removal of valid data. Therefore, an appropriate cloud fraction threshold must be selected to effectively exclude most cloud-obscured cases while minimizing the loss of valid observations. From the perspective of data accuracy, this method cannot completely eliminate observations affected by cloud cover, which may lead to a certain degree of accuracy degradation. However, there is currently no approach that can entirely remove cloud-contaminated data, and we can only strive to improve and optimize the process as much as possible.
The TCC screening threshold is adjusted using AERONET observations of solar obscuration by clouds. O4 dSCD data from Beijing are screened separately by ERA5 TCC and AERONET data. The screening rates based on TCC ( S E T c c ) and AERONET ( S E A E R ) are calculated using Equations (1)–(3):
d S C D A T O 4 = d S C D T c c O 4 d S C D A E R O 4 ,
S E T c c = | d S C D A T O 4 | | d S C D T c c O 4 | ,
S E A E R = | d S C D A T O 4 | | d S C D A E R O 4 | ,
where d S C D T c c O 4 and d S C D A E R O 4 denote the O4 dSCD dataset filtered based on TCC and AERONET data, respectively. d S C D A T O 4 represents the intersection of datasets. | d S C D A T O 4 |, | d S C D A E R O 4 |, and | d S C D T c c O 4 | are the number of data points in the d S C D A T O 4 , d S C D A E R O 4 , and d S C D T c c O 4 datasets, respectively.
Next, the most suitable TCC screening threshold needed to be determined. The sums of the screening rates at different TCC thresholds were calculated and fitted with curves. As shown in Figure 2, the intersection of these two curves corresponds to a TCC threshold of 0.145. At this threshold, the amount of data retained after TCC screening is comparable to that retained after AERONET screening, and the probability of cloud presence is relatively low. Under this threshold, the cloud-filtered O4 dSCD dataset obtained via TCC screening exhibits a high degree of agreement with the dataset filtered by AERONET.

2.5. Model Development

As Figure 3 shows, the QDOAS software (version 3.6.5) analyzes the spectra collected by the MAX-DOAS instrument to retrieve O4 dSCD values at the instrument’s sequence of elevation angles. Some O4 dSCD data are affected by cloud obscuration, while AERONET AOD data are only available under cloud-free conditions. By matching AERONET AOD data with O4 dSCD data based on time, cloud screening of O4 dSCD can be performed. Since not all MAX-DOAS sites have nearby operational AERONET stations, this study uses ERA5 TCC data for cloud screening of O4 dSCD instead. An appropriate TCC screening threshold is selected by calculating screening rates and fitting curves to find their intersection point. After applying this TCC threshold, the amount of O4 dSCD data retained closely matches that filtered by AERONET AOD, showing a high degree of overlap. The cloud-screened O4 dSCD values at all elevation angles except 90° are used as input features. The feature vector for training includes the cloud-filtered O4 dSCD sequences at different elevation angles, seasonal variables, and hourly variables representing the time of day. The feature vector is shown in Equation (4). Ground-level hourly PM2.5 concentration data from the China National Environmental Monitoring Center (CNEMC) are used as labels, with measurement times converted to UTC+0. The selected CNEMC sites are the closest to the respective MAX-DOAS sites.
All machine learning models used in this study employed the Randomized Search CV optimization method to obtain the optimal hyperparameters. The hyperparameter configurations vary among different models, and the detailed settings are provided in Table 2.
In this study, multiple machine learning models are compared, and the best-performing model is selected. To validate the reliability of the SAPML method, spatial three-fold cross-validation and temporal extrapolation tests are conducted. After achieving satisfactory results in both spatial and temporal validation, the trained model is applied to MAX-DOAS instruments to obtain temporal variation characteristics and horizontal distribution features of PM2.5 mass concentrations. This approach also allows differences in PM2.5 mass concentrations to be partially analyzed from different viewing directions at the observation sites.
x t = [ O 4 ( α 1 ) ( t ) , O 4 ( α 2 ) ( t ) , , O 4 ( α n ) ( t ) , s ( t ) , h ( t ) ]
where x t denotes the feature vector at time t , O 4 ( α i ) ( t ) represents the cloud-screened O4 dSCD measured at elevation angle α i (excluding 90°), s ( t ) is a one-hot-encoded seasonal variable (spring = [1, 0, 0, 0], summer = [1, 0, 0, 0], autumn = [0, 0, 1, 0], and winter = [0, 0, 0, 1]), and h ( t ) is the hour-of-day information (0–23).

2.6. Model Evaluation Metrics

This study uses the Pearson correlation coefficient (R) and mean absolute error (MAE) as evaluation metrics for the models. R measures the strength of the linear relationship between two variables and ranges from −1 to 1. An R value of 1 indicates a perfect positive correlation, 0 indicates no correlation, and −1 indicates a perfect negative correlation [51]. MAE represents the average absolute deviation between predicted and true values. A smaller MAE indicates higher prediction accuracy [52]. The formulas for calculating R and MAE are given by
R = i = 1 n ( y i y ¯ ) ( y ^ i y ^ ¯ ) i = 1 n ( y i y ¯ ) 2 × i = 1 n ( y ^ i y ^ ¯ ) 2 ,
MAE = 1 n i = 1 n y i   -   y ^ i ,
where y i denotes the true values, which in this study correspond to the CNEMC PM2.5 concentrations, and y ^ i represents the model-estimated values of PM2.5. y ¯ is the mean of the true values, and y ^ ¯ is the mean of the model-estimated values. n is the total number of samples.

3. Results

3.1. Model Comparison and Temporal Extrapolation of Models

To select the optimal machine learning model for the SAPML method, this study first merged the O4 dSCD data obtained from spectral analysis of MAX-DOAS measurements at BJ-CAM, LZU, and AHU in chronological order. Cloud screening was then applied based on TCC data.
In this study, four different models—Random Forest, XGBoost, Backpropagation (BP) Neural Network, and LightGBM—were selected for comparative analysis [53]. The processed dataset was divided sequentially in chronological order, with the first 80% used for training and the remaining 20% for testing. The hyperparameters of the four models were optimized using a randomized cross-validation search (Randomized Search CV) [54]. Since the temporal order of the data was preserved, the model comparison also serves as a validation of temporal extrapolation capability. The hyperparameter configurations of each model are summarized in Table 2.
The test results are shown in Figure 4. The Random Forest model achieved a correlation coefficient (R) of 0.817, a mean absolute error (MAE) of 10.469, a regression slope of 0.65, and an intercept of 11.58. The XGBoost model achieved an R of 0.825, MAE of 10.183, slope of 0.68, and intercept of 10.57. The LightGBM model achieved an R of 0.824, MAE of 10.241, slope of 0.68, and intercept of 10.56. The BP neural network achieved an R of 0.813, MAE of 10.625, slope of 0.65, and intercept of 11.85.
Machine learning models often tend to smooth extreme values, typically resulting in overestimation of low values and underestimation of high values [55]. Consequently, their fitted regression lines generally have slopes less than 1 and intercepts greater than 0. Ideally, the closer the slope is to 1 and the intercept to 0, the better the model’s estimation performance. The results show that XGBoost and LightGBM exhibit very similar R, MAE, slope, and intercept values, while Random Forest performs slightly worse, with a lower slope and higher intercept. All four models demonstrate reasonable temporal extrapolation capability, with the lowest R being 0.813 and the highest MAE 10.625. Considering its smaller size and faster training speed, XGBoost was selected as the model for the SAPML method.

3.2. Impact of Cloud Removal

To evaluate the effectiveness of cloud contamination removal using ERA5 TCC data, O4 dSCD data collected and analyzed by MAX-DOAS instruments at BJ-CAM, LZU, and AHU were merged sequentially by time. Two datasets were created for comparison: one containing the original unfiltered data, and the other filtered using a TCC threshold of 0.145 for cloud screening.
Both datasets were randomly split into 80% training and 20% testing sets, with XGBoost selected as the modeling algorithm. Correlation analysis was performed between the PM2.5 concentrations predicted on the test sets and the PM2.5 concentrations observed by CNEMC. The results are shown in Figure 5. For the unfiltered data test set, the correlation analysis yielded an R of 0.706, MAE of 12.200, slope of 0.49, and intercept of 17.17. For the cloud-filtered data test set, the results improved to an R of 0.826, MAE of 10.197, slope of 0.68, and intercept of 10.61. Clouds and rain sometimes occur simultaneously. During rainy periods, PM2.5 concentrations are typically low, leading the model to overestimate them. Consequently, the slope for unfiltered data is lowered, as both the overestimation of low values and the underestimation of high values reduce the slope. The higher R and lower MAE of the cloud-filtered data indicate that TCC-based cloud screening, with an appropriately selected threshold, can enhance the quality of the training dataset.

3.3. Spatial Extrapolation Validation

This study conducted a three-site cross-validation (CV) using MAX-DOAS instruments at BJ-CAM, LZU, and AHU. In each validation, data from two instruments were used as the training set, and those from the remaining instrument served as the validation set, resulting in three combinations. To eliminate the influence of temporal differences, only the common period (22–10 October 2020) across the three instruments was used for validation. Based on previous model comparison results, machine learning models tend to perform worse at extremely high or low PM2.5 concentrations, mainly showing overestimation at low values and underestimation at high values. All O4 dSCD data were cloud-screened using a TCC threshold of 0.145.
The validation results for BJ-CAM, LZU, and AHU are shown in Figure 6. Using LZU and AHU as training sets and BJ-CAM as validation, the results were R = 0.865, MAE = 10.203, slope = 0.62, and intercept = 14.63. Using BJ-CAM and AHU as the training sets and LZU for validation, the results were R = 0.715, MAE = 8.257, slope = 0.55, and intercept = 16.30. Using BJ-CAM and LZU as the training sets and AHU for validation, the results were R = 0.767, MAE = 8.785, slope = 0.51, and intercept = 20.33. The relatively lower R value and higher MAE for BJ-CAM may be attributed to two factors: Firstly, the difference in data volume, as BJ-CAM has the largest dataset among the three sites; Secondly, BJ-CAM contains more high PM2.5 values, and MAE reflects relative errors. Spatial extrapolation leads to a decline in the model’s predictive performance, with predictions biased toward the mean. This results in underestimation of high values and overestimation of low values, causing the slope to decrease. All three validation sets achieved R values above 0.7 and MAEs below 10.5 µg/m3, demonstrating that SAPML has reasonable spatial extrapolation capability.

3.4. Feature Optimization

Feature importance analysis was conducted on the XGBoost model, and the results are shown in Figure 7. The O4 dSCD values at high elevation angles (8°, 10°, 15°, 30°) and the solar azimuth angle (SAA) exhibited relatively low importance in the model. This is because scattered light at high angles travels shorter distances near the surface, so the corresponding O4 dSCD data contain less information about near-ground-surface PM2.5 concentrations. The 1° elevation angle corresponds to scattered light penetrating the longest near-ground-surface path, but its signal intensity is very weak. Consequently, the feature importance of the 1° O4 dSCD is lower than those of the 2°, 3°, 4°, 5°, and 6° angles, but higher than those at 8°, 10°, 15°, and 30°. Considering this, feature optimization can be performed by removing SAA and the high-angle (8°, 10°, 15°, 30°) O4 dSCD data. In conventional MAX-DOAS measurements, data from multiple elevation angles are typically used to retrieve the vertical profiles of atmospheric pollutants. However, in this study, the primary objective is to estimate near-surface PM2.5 concentrations. Therefore, high-angle spectral observations can be omitted during actual measurements. This approach reduces the number of observation angles, thereby shortening measurement time and improving temporal resolution, while having only a minor impact on model accuracy. Using O4 dSCD data collected and analyzed from the BJ-CAM, LZU, and AHU MAX-DOAS instruments, merged chronologically and cloud-screened with a TCC threshold of 0.145, the dataset was split into 80% training and 20% testing sets. XGBoost models were trained with and without the optimized feature set. The fitting results for the test set are shown in Figure 6. Before feature removal, the test set results were R = 0.825 and MAE = 10.183, while after removing only the SAA feature, they were R = 0.824 and MAE = 10.254. When only the high-angle features were removed, the results were R = 0.816 and MAE = 10.414. After removing both the high-angle and SAA features simultaneously, the test set results were R = 0.814 and MAE = 10.464. Following the removal of high-angle and SAA features, the model’s correlation coefficient (R) decreased by only 1.3%, and the MAE increased by merely 2.5%, while the data acquisition time was reduced by approximately 30%. Therefore, the SAPML method can adopt the optimized feature set by excluding SAA and high-angle O4 dSCD variables.

3.5. Temporal Variation in Horizontal Distribution of PM2.5 in Hefei

To obtain the horizontal distribution of PM2.5 in Hefei, this study deployed MAX-DOAS instruments at the two sites shown in Figure 1b, namely HFC and HF-HD. The O4 dSCD data at various elevation angles were obtained through spectral analysis of the observed sunlight. To analyze the differences in PM2.5 concentration across different directions, the MAX-DOAS instruments at HFC and HF-HD were set to observe multiple azimuth directions.
The trained model was applied to data from both HFC and HF-HD sites, producing PM2.5 concentration distributions near HFC from 19 January 2021 to 25 June 2022 and near HF-HD from 28 January 2021 to 14 May 2022. The observation directions at the HFC site were due east and due south. At the HF-HD site, the observation directions were due north and due south before July 2021, and switched to four directions—east, west, south, and north—after September 2021.
Figure 8 shows monthly average line charts for different directions at both sites. Some observation data are missing due to instrument maintenance and other factors. Monthly averages were calculated only when data for more than 10 days in a month were available. At HF-HD, the highest monthly average concentrations for all four directions occurred in January 2022: the north direction peaked at 67.08 µg/m3, east at 67.36 µg/m3, south at 63.01 µg/m3, and west at 67.34 µg/m3. The lowest monthly averages at HF-HD were recorded in July 2021 for the north and south directions, with values of 23.63 µg/m3 and 23.74 µg/m3, respectively. At HFC, the highest monthly averages were observed in January 2022 for the south (59.13 µg/m3) and west (59.62 µg/m3) directions. The lowest monthly averages at HFC were recorded in July 2021 for the south (24.54 µg/m3) and west (25.40 µg/m3) directions.

3.6. Spatial Distribution of PM2.5 in Hefei

As shown in Figure 9, the proportions of different PM2.5 concentration levels were statistically analyzed for each direction at the HF-HD and HFC sites. The statistics for HFC cover the entire observation period, while those for HF-HD start from the time when four observation directions were available. The results are presented in a rose chart format combined with a map. The chart includes the four cardinal directions—east, south, west, and north—with the proportion of PM2.5 concentration in each direction shown. Different concentration ranges are represented by distinct colors, and the width of each segment corresponds to the percentage of data points within that concentration range relative to the total data volume.
At HF-HD, the proportion of PM2.5 concentrations exceeding 80.00 µg/m3 was 5.80% for the north direction, 4.02% for the east, 2.74% for the south, and 5.51% for the west. At HFC, the proportions exceeding 80.00 µg/m3 were 2.89% for the south direction and 2.92% for the west.
The north direction at HF-HD points toward the suburban area of Hefei, which has a dense industrial park, explaining why the proportion of PM2.5 concentrations above 80.00 µg/m3 is higher in this direction than in the other three. The notably lower proportion of high PM2.5 concentrations (>80.00 µg/m3) in the south direction of HF-HD may be related to geographical factors. There is a large ecological park with dense vegetation to the south of HF-HD, and the river flows along this direction, facilitating the deposition of PM2.5 into the water [56]. At HFC, the geographic environments of the west and south directions are similar, which explains why the proportions of high PM2.5 concentrations in these two directions are close.
As shown in Figure 10, the PM2.5 concentrations retrieved at the HF-HD and HFC sites were categorized by season, and the proportions of different PM2.5 concentration levels for each direction were statistically analyzed. The visualization follows the rose chart style used in Figure 9.
During winter 2021, PM2.5 concentrations reached their highest levels at HF-HD, with the proportions of concentrations above 80.00 µg/m3 being 13.09% in the north, 9.07% in the east, 6.27% in the south, and 12.44% in the west. At HFC, winter 2021 also saw the highest PM2.5 concentrations, with 5.34% and 5.26% exceeding 80.00 µg/m3 in the south and west directions, respectively. In contrast, summer 2021 showed the lowest PM2.5 concentrations at HF-HD, with no concentrations exceeding 80.00 µg/m3 in the north or south directions. The proportion of concentrations below 20.00 µg/m3 was 28.29% in the north and 27.28% in the south. Similarly, at HFC in summer 2021, no PM2.5 concentrations exceeded 80.00 µg/m3 in the north or south directions, with proportions of low concentrations (<20.00 µg/m3) of 29.21% in the north and 20.52% in the south. Based on the analysis of Figure 9, the proportion of high PM2.5 concentrations in the south direction at HF-HD during the observation period was lower than in other directions, which may have been caused by the winter conditions of 2021.
The higher proportions of elevated PM2.5 concentrations at HF-HD during winter 2021 compared to HFC may be attributed to its location in a commercial center with higher pedestrian and vehicular traffic, whereas HFC is located on a college campus with reduced traffic during winter break. These findings highlight the spatial heterogeneity of PM2.5 concentrations within Hefei city, showing not only differences between regions during the same season but also directional variability within the same site. Such fine-scale non-uniform distribution cannot be captured by CNEMC’s PM2.5 monitoring data alone but can be effectively resolved using the SAPML method.

3.7. Transport of PM2.5 During a Pollution Episode in Hefei

On 23 December 2021, Hefei experienced a pollution episode under clear weather conditions with a level 3 easterly wind. Using the trained model along with the O4 dSCD data observed and analyzed at HF-HD and HFC on that day, the spatial distribution of PM2.5 concentrations around these sites was obtained.
Figure 11a illustrates the temporal variations in PM2.5 concentrations along the east and west directions at HF–HD and the west direction at HFC on 23 December. Between 09:00 and 10:00, an increase in PM2.5 concentration was first observed east of HF–HD. Subsequently, PM2.5 levels west of HF–HD and HFC also began to rise. Before noon, the PM2.5 concentration east of HF–HD exceeded 80 µg/m3 and was also observed after noon west of HF–HD. Around 13:00, the PM2.5 concentration west of HFC approached 80 µg/m3. Figure 11b presents the spatial distribution of PM2.5 concentrations from 10:00 to 13:00 on the same day. Between 10:00 and 11:00, high PM2.5 concentrations were first detected on the eastern side of HF–HD, followed by a gradual increase on the western side of HF–HD and west of HFC. By 12:00, the PM2.5 concentrations on both the east and west sides of HF–HD reached approximately 82 µg/m3, while those west of HFC rose to about 70 µg/m3. By 13:00, the PM2.5 concentration west of HFC had further increased to 77 µg/m3. At 14:00, the concentration along the east of HF–HD began to decrease, whereas the areas west of HF–HD and HFC still exhibited high concentrations.

4. Discussion

In this section, the performance of four machine learning models in temporal extrapolation validation is analyzed and XGBoost model is compared with that of a traditional linear fitting that converts extinction coefficients into PM2.5 concentrations. The temporal evolution of the horizontal distribution of PM2.5 in Hefei, including the heterogeneous distribution of urban pollution and its causes, is presented. Finally, a pollution episode in Hefei on 23 December 2021 is analyzed, capturing the PM2.5 pollution transport process within the urban area.
In Section 3.1, as shown in Figure 4, among the four machine learning models in temporal extrapolation validation, XGBoost performs the best, achieving a correlation coefficient R of 0.825. Meanwhile, LightGBM, Random Forest, and the BP neural network perform less well; the BP neural network, in particular, exhibits the lowest R on the test set, with a lower slope and a higher intercept, which may be attributed to overfitting caused by the limited size of the training dataset. Wang et al. conducted MAX-DOAS observations in the urban environment of Madrid, Spain. Because the concentration of O4 in the atmosphere is approximately constant, changes in the optical path caused by atmospheric scattering result in corresponding variations in O4 dSCD. Based on this principle, aerosol extinction coefficients can be retrieved using multi-elevation O4 dSCD measurements. Wang et al. employed O4 dSCD in the ultraviolet (UV) spectral region and retrieved aerosol extinction coefficient profiles via an optimization-based inversion algorithm. Subsequently, a linear fit between the aerosol extinction coefficients and PM2.5 mass concentrations yielded an hourly mean correlation coefficient R of 0.64 [57]. Similar work was conducted by Lee H. et al. in Fresno, USA, during 12 haze events. From MAX-DOAS observations, aerosol extinction coefficient vertical profiles and aerosol optical depths at two UV wavelengths, 353 nm and 380 nm, were retrieved. The correlation coefficients between aerosol extinction coefficients and near-surface PM2.5 concentrations were R = 0.73 and R = 0.74 at 380 nm and 353 nm, respectively [58]. In the present study, sequences of O4 dSCD retrieved from spectroscopic observations were directly used as input features. Multi-elevation O4 dSCD measurements from the AHU, BJ-CAM, and LZU sites served as predictors, while PM2.5 mass concentrations at the corresponding CNMEC sites along the observation azimuths were used as labels. An XGBoost model was trained to learn the nonlinear relationships between O4 dSCD sequences at different elevation angles and PM2.5 concentrations. The trained model is capable of directly estimating PM2.5 concentrations from O4 dSCD sequences. This approach avoids the cumulative errors associated with first retrieving aerosol extinction coefficients from O4 dSCD and then estimating PM2.5 concentrations through their moderate linear correlation [59]. As a result, the XGBoost model achieved a correlation coefficient of 0.825.
In Section 3.5, as shown in Figure 8, the monthly average concentration trends at both sites showed similar seasonal patterns. During the observation period, all observed directions exhibited higher PM2.5 concentrations in winter and lower concentrations in summer. The highest monthly averages appeared in January 2022 and the lowest in July 2021. At the HFC site, no significant differences between the two observation directions were observed across months. However, at the HF-HD site, in the month with the highest average concentration (January 2022), the southward concentration was noticeably lower than in other directions. This phenomenon may be related to both the geographic environments surrounding the HF-HD site and the prevailing wind conditions during that month. The geographic environment around HF-HD is shown in Figure A1a of Appendix B, and the wind rose for HF-HD in January 2022 is presented in Figure A1b of Appendix B; wind direction and speed data were obtained from ERA5. An industrial park is located in the northern part of Hefei, and northerly winds predominated in January. Consequently, pollutants emitted from the northern industrial area were transported south-ward toward the HF-HD site. The northern, eastern, and southern surroundings of HF-HD are primarily commercial and residential areas, while a large urban park with substantial green coverage lies to its south. In addition, the southern side of HF-HD follows the flow direction of the Fei River, where PM2.5 particles transported from the north tend to deposit. As a result, the monthly average PM2.5 concentration in the southern direction of HF-HD is typically lower than in other directions. During the observation period, all observed directions exhibited higher PM2.5 concentrations in winter and lower concentrations in summer. The highest monthly averages appeared in January 2022, and the lowest in July 2021. Through multi-directional observations, the SAPML method can detect the heterogeneous distribution of PM2.5 concentrations within urban areas. The advantage of this approach lies in its ability to resolve PM2.5 concentration differences across different directions at the same site. However, a potential limitation is that increasing the number of observation directions may reduce the temporal resolution for each direction.
In Section 3.7, as shown in Figure 11, the PM2.5 pollution transport event that occurred in Hefei on 23 December 2021 is analyzed based on observations from the HF-HD and HFC sites, combined with ERA5 wind speed and wind direction data. Pollution transport first appeared at 10:00, when elevated PM2.5 concentrations were detected on the east side of HF-HD, accompanied by a wind speed of 2.9 m/s. At 11:00, increased PM2.5 concentrations were detected on the west side of HF-HD and the west side of HFC, while concentrations on the east side of HF-HD had already risen above 70 µg/m3; the wind speed at this time was 2.7 m/s. By 12:00, PM2.5 concentrations on both the east and west sides of HF-HD exceeded 80 µg/m3, and the west side of HFC exceeded 70 µg/m3, with a wind speed of 2.5 m/s. At 13:00, the PM2.5 concentration on the east side of HF-HD decreased to below 80 µg/m3 but remained above 70 µg/m3, whereas no decreasing trend was observed on the west side of HF-HD or the west side of HFC; the wind speed at this time was 3.0 m/s. At 14:00, the PM2.5 concentration on the east side of HF-HD dropped below 60 µg/m3, while the west sides of both HF-HD and HFC began to show a decreasing trend; the wind speed was 3.1 m/s. By analyzing the sequence in which peak concentrations occurred at the three directions, it can be inferred that the pollution was transported from east to west, consistent with the easterly winds observed on that day. During the transport period, wind speeds ranged from approximately 2.5 to 3.1 m/s, corresponding to a one-hour transport distance of about 9–11 km. The pollution plume reached the east side of HF-HD at 10:00 and arrived at the west side of HF-HD and the west side of HFC approximately one hour later. Between 12:00 and 14:00, all three directions remained in a high PM2.5 pollution state, followed by a decline after 14:00. The above analysis demonstrates that the two MAX-DOAS instruments jointly captured a PM2.5 pollution plume, which was transported across Hefei from the east side of HF-HD to the west side of HFC between 10:00 and 14:00 on 23 December 2021. This finding indicates that the proposed method has the potential to monitor PM2.5 emission dispersion and its transport processes within urban environments. During this event, the daily peak concentrations across all three monitored directions occurred around midday, which differs from typical local PM2.5 pollution patterns. By combining multi-azimuth observations from two independent MAX-DOAS instruments, the SAPML method is able to capture the intra-urban PM2.5 transport process. This approach provides valuable insights for studying pollution transport within urban environments.
The SAPML method, when combined with multi-azimuth MAX-DOAS observations, enables rapid estimation of PM2.5 concentrations using only the instrument itself, without requiring high-performance computing resources. Each estimation takes less than one minute. By bypassing the aerosol extinction retrieval step obtained through computationally intensive optimization algorithms, SAPML avoids the accumulation of errors in conventional PM2.5 estimation approaches, resulting in a stronger correlation with in situ measurements. In addition, coordinated multi-directional observations from multiple instruments allow for long-term assessment of intra-urban pollutant heterogeneity and the detection of short-term pollution transport processes.

5. Conclusions

This study proposes a new framework for estimating PM2.5 concentrations that integrates spectral analysis with artificial intelligence, termed SAPML. Unlike the conventional MAX-DOAS aerosol inversion approach, SAPML directly estimates PM2.5 concentrations from multi-elevation O4 dSCD measurements, thereby bypassing the aerosol extinction retrieval step and avoiding the cumulative errors associated with aerosol extinction inversion and subsequent PM2.5 estimation. In practical applications, SAPML does not rely on meteorological data and enables direct PM2.5 monitoring using observations from a single instrument. When multi-azimuth MAX-DOAS observations are available, SAPML can also resolve the spatial distribution of PM2.5 within different areas of a city. Furthermore, by combining observations from two MAX-DOAS instruments, the method can be used to analyze pollution transport processes during pollution events.
Cloud screening of O4 dSCD data was performed for data quality. In addition, model selection, validation, and optimization were conducted. The trained model was then applied to two MAX-DOAS sites in Hefei to reveal the spatio-temporal characteristics of PM2.5 distribution. Raw O4 dSCD data are often contaminated by clouds. Cloud screening was conducted using ERA5 TCC data. The cloud screening threshold was determined by calculating the screening rate and identifying the intersection point of the fitted curves, which is given by 0.145. On this condition, the screening performance closely matched that of AOD-based screening.
To select the most suitable machine learning algorithm for SAPML, O4 dSCD observations from the BJ-CAM, LZU, and AHU stations were merged in chronological order and split into training and test sets. The LightGBM, RF, and XGBoost models were compared. XGBoost achieved the highest test set correlation (R = 0.825) and was chosen as the SAPML algorithm. Spatial extrapolation capability was evaluated via leave-one-site-out cross-validation using data from the three stations. The mean correlation coefficient reached 0.782, indicating that the model possesses a certain degree of spatial generalization capability. Feature vector optimization was performed by removing high-elevation O4 dSCD and SAA. The performance decrease after optimization was minimal (R decreased by approximately 1.3%), while the data acquisition time was reduced by about 30%.
The trained model was applied to the HF-HD and HFC sites. The results indicated higher PM2.5 concentrations in winter and lower concentrations in summer across all directions. The highest monthly mean concentrations in each direction occurred in January 2022, while the lowest occurred in July 2021. The south and west directions at HFC showed negligible differences, whereas the south direction at HF-HD generally exhibited lower PM2.5 concentrations than the other directions, likely because the south side of HF-HD is an ecological park, while the remaining directions are surrounded by industrial and residential areas. A pollution transport event from east to west on 23 December 2021 showed that the increase in PM2.5 concentrations was first detected east of HF-HD, followed by west of HF-HD, and finally west of HFC. This demonstrates that the SAPML method can reveal intra-urban pollution transport processes.
This study proposes a novel framework named SAPML for estimating PM2.5 mass concentrations. Based on spectral analysis and artificial intelligence, SAPML can bypass extinction coefficients and directly retrieve PM2.5 mass concentrations. The method is applicable to multiple cities in China. The SAPML method is suitable for analyzing the heterogeneous distribution of PM2.5 within urban areas, as it can resolve spatial differences among different sites in the same city, as well as directional variations at individual sites. The method is potentially valuable for investigating the transport, transformation, deposition, and formation of PM2.5 within urban areas. In the future, we plan to combine the real-time PM2.5 mass concentrations detected using SAPML with artificial intelligence techniques to reasonably predict future PM2.5 levels.

Author Contributions

Conceptualization, Q.L. and H.Q.; Methodology, H.Q.; Software, Z.Z. and T.G.; Investigation, S.X.; Resources, W.T.; Data Curation, H.Q., Q.L., S.X. and Z.Z.; Writing—Original Draft, H.Q.; Supervision, Q.H.; Project Administration, Q.H.; Funding Acquisition, Q.L. and Q.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (2022YFC3700100), the Key Research and Development Project of Anhui Province (2023t07020015), the Key Research Program of Frontier Sciences, CAS (No. ZDBS-LY-DQC008), the Youth Innovation Promotion Association of CAS (2021443), the HFIPS Director’s Fund (BJPY2022B07 and YZJJQY202303), the Hefei Comprehensive National Science Center, and the major science and technology special project of the Xinjiang Uygur Autonomous Region (2024A03012).

Data Availability Statement

The datasets generated and/or analyzed in this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to express their gratitude to the China National Environmental Monitoring Center (CNEMC) for providing free hourly ground-level PM2.5 concentration data from its monitoring stations (http://www.cnemc.cn/, last accessed on 5 September 2025). Special thanks are extended to the UV-Visible DOAS team of the BIRA-IASB, led by M. Van Roozendael, for their work on the multi-axis differential optical absorption spectroscopy (DOAS) instrument. This study employed spectral fitting based on the QDOAS (version 3.6.5) software (a free and open-source program developed by the authors, available at https://uv-vis.aeronomie.be/software/QDOAS/, last accessed on 5 September 2025). The authors also acknowledge the European Centre for Medium-Range Weather Forecasts (ECMWF) and the Copernicus Climate Change Service (C3S) for providing the ERA5 reanalysis dataset. The AERONET data were provided by the National Aeronautics and Space Administration (NASA) and the Centre National d’Études Spatiales (CNES). Finally, the authors thank all the researchers whose work is cited in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The detailed parameter configuration of the QDOAS software used in this study is shown in Table A1.
Table A1. Parameter configuration of the QDOAS software.
Table A1. Parameter configuration of the QDOAS software.
Wavelength RangeParametersData Sources
338–370 nmNO2298 K, I0 correction (SCD of 1017 molecules cm−2)
O3223 K, I0 correction (SCD of 1020 molecules cm−2)
O3293 K, I0 correction (SCD of 1020 molecules cm−2)
O4293 K
HCHO297 K
SO2294 K
BrO223 K
H2O296 K, HITEMP

Appendix B

Figure A1. (a) The map of HF-HD (b) The wind rose of HF-HD in January 2022.
Figure A1. (a) The map of HF-HD (b) The wind rose of HF-HD in January 2022.
Remotesensing 17 03780 g0a1

References

  1. Huang, R.-J.; Zhang, Y.; Bozzetti, C.; Ho, K.-F.; Cao, J.-J.; Han, Y.; Daellenbach, K.R.; Slowik, J.G.; Platt, S.M.; Canonaco, F.; et al. High secondary aerosol contribution to particulate pollution during haze events in China. Nature 2014, 514, 218–222. [Google Scholar] [CrossRef]
  2. Feng, S.; Gao, D.; Liao, F.; Zhou, F.; Wang, X. The health effects of ambient PM2.5 and potential mechanisms. Ecotoxicol. Environ. Saf. 2016, 128, 67–74. [Google Scholar] [CrossRef]
  3. World Health Organization. WHO Global Air Quality Guidelines: Particulate Matter (PM2.5 and PM10), Ozone, Nitrogen Dioxide, Sulfur Dioxide, and Carbon Monoxide; World Health Organization: Geneva, Switzerland, 2021; Available online: https://www.who.int/publications/i/item/9789240034228 (accessed on 10 July 2025).
  4. Cohen, A.J.; Brauer, M.; Burnett, R.; Anderson, H.R.; Frostad, J.; Estep, K.; Balakrishnan, K.; Brunekreef, B.; Dandona, L.; Dandona, R.; et al. Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: An analysis of data from the Global Burden of Diseases Study 2015. Lancet 2017, 389, 1907–1918. [Google Scholar] [CrossRef] [PubMed]
  5. Pope, C.A., III; Burnett, R.T.; Thun, M.J.; Calle, E.E.; Krewski, D.; Ito, K.; Thurston, G.D. Lung cancer, cardiopulmonary mortality, and long-term exposure to fine particulate air pollution. JAMA 2002, 287, 1132–1141. [Google Scholar] [CrossRef] [PubMed]
  6. Huang, C.; Chen, B.-Y.; Pan, S.; Ho, Y.; Guo, Y. Prenatal exposure to PM2.5 and congenital heart diseases in Taiwan. Sci. Total. Environ. 2019, 655, 880–886. [Google Scholar] [CrossRef] [PubMed]
  7. Health Effects Institute (HEI). State of Global Air 2022. Special Report: The Health Impacts of Air Pollution. 2022. Available online: https://www.stateofglobalair.org/ (accessed on 11 July 2025).
  8. Balakrishnan, K.; Dey, S.; Gupta, T.; Dhaliwal, R.S.; Brauer, M.; Cohen, A.J.; Stanaway, J.D.; Beig, G.; Joshi, T.K.; Aggarwal, A.N. The impact of air pollution on deaths, disease burden, and life expectancy across the states of India: The global burden of disease study 2017. Lancet Planet. Health 2019, 3, e26–e39. [Google Scholar] [CrossRef]
  9. Chow, J.C.; Watson, J.G. New directions: Beyond compliance air quality measurements. Atmos. Environ. 2008, 42, 5166–5168. [Google Scholar] [CrossRef]
  10. U.S. Environmental Protection Agency. Air Quality System (AQS) and Reference/Equivalent Methods for PM2.5 Measurement; U.S. Environmental Protection Agency: Washington, DC, USA, 2016. Available online: https://www.epa.gov/ (accessed on 11 July 2025).
  11. Schwab, J.J.; Felton, H.D.; Rattigan, O.V.; Demerjian, K.L. New York state urban and rural measurements of continuous PM2.5 mass by FDMS, TEOM, and BAM. J. Air Waste Manag. Assoc. 2006, 56, 372–383. [Google Scholar] [CrossRef]
  12. Le, T.-C.; Shukla, K.K.; Chen, Y.-T.; Chang, S.-C.; Lin, T.-Y.; Li, Z.; Pui, D.Y.; Tsai, C.-J. On the concentration differences between PM2.5 FEM monitors and FRM samplers. Atmos. Environ. 2020, 222, 117138. [Google Scholar] [CrossRef]
  13. Chung, A.; Chang, D.P.Y.; Kleeman, M.J.; Perry, K.D.; Cahill, T.A.; Dutcher, D.; McDougall, E.M.; Stroud, K. Comparison of real-time instruments used to monitor airborne particulate matter. J. Air Waste Manag. Assoc. 2001, 51, 109–120. [Google Scholar] [CrossRef]
  14. Wu, D.; Zhang, G.; Liu, J.; Shen, S.; Yang, Z.; Pan, Y.; Zhao, X.; Yang, S.; Tian, Y.; Zhao, H.; et al. Influence of particle properties and environmental factors on the performance of typical particle monitors and low-cost particle sensors in the market of China. Atmos. Environ. 2022, 268, 118825. [Google Scholar] [CrossRef]
  15. China National Environmental Monitoring Center. National Air Quality Monitoring Network Data Technical Specification (HJ 663–2021); Ministry of Ecology and Environment of the People’s Republic of China: Beijing, China, 2021; Available online: http://www.cnemc.cn/ (accessed on 11 July 2025).
  16. Wang, Y.; Ying, Q.; Hu, J.; Zhang, H. Spatial and temporal variations of six criteria air pollutants in 31 provincial capital cities in China during 2013–2014. Environ. Int. 2014, 73, 413–422. [Google Scholar] [CrossRef]
  17. Birmili, W.; Wiedensohler, A.; Heintzenberg, J.; Lehmann, K. Atmospheric particle number size distribution in central Europe: Statistical relations to air masses and meteorology. J. Geophys. Res. Atmos. 2001, 106, 32005–32018. [Google Scholar] [CrossRef]
  18. Hansen, A.D.A.; Rosen, H.; Novakov, T. The aethalometer—An instrument for the real-time measurement of optical absorption by aerosol particles. Sci. Total. Environ. 1984, 36, 191–196. [Google Scholar] [CrossRef]
  19. Backman, J.; Schmeisser, L.; Virkkula, A.; Ogren, J.A.; Asmi, E.; Starkweather, S.; Sharma, S.; Eleftheriadis, K.; Uttal, T.; Jefferson, A.; et al. On Aethalometer measurement uncertainties and an instrument correction factor for the Arctic. Atmos. Meas. Tech. 2017, 10, 5039–5062. [Google Scholar] [CrossRef]
  20. Apte, J.S.; Messier, K.P.; Gani, S.; Brauer, M.; Kirchstetter, T.W.; Lunden, M.M.; Marshall, J.D.; Portier, C.J.; Vermeulen, R.C.; Hamburg, S.P. High-resolution air pollution mapping with Google street view cars: Exploiting big data. Environ. Sci. Technol. 2017, 51, 6999–7008. [Google Scholar] [CrossRef] [PubMed]
  21. Yu, Y.T.; Xiang, S.; Li, R.; Zhang, S.; Zhang, K.M.; Si, S.; Wu, X.; Wu, Y. Characterizing spatial variations of city-wide elevated PM10 and PM2.5 concentrations using taxi-based mobile monitoring. Sci. Total. Environ. 2022, 829, 154478. [Google Scholar] [CrossRef]
  22. Snyder, E.G.; Watkins, T.H.; Solomon, P.A.; Thoma, E.D.; Williams, R.W.; Hagler, G.S.W.; Shelow, D.; Hindin, D.A.; Kilaru, V.J.; Preuss, P.W. The changing paradigm of air pollution monitoring. Environ. Sci. Technol. 2013, 47, 11369–11377. [Google Scholar] [CrossRef]
  23. Kumar, P.; Morawska, L.; Martani, C.; Biskos, G.; Neophytou, M.; Di Sabatino, S.; Bell, M.; Norford, L.; Britter, R. The rise of low-cost sensing for managing air pollution in cities. Environ. Int. 2015, 75, 199–205. [Google Scholar] [CrossRef]
  24. King, M.D.; Kaufman, Y.J.; Tanré, D.; Nakajima, T. Remote sensing of tropospheric aerosols from space: Past, present, and future. Bull. Am. Meteorol. Soc. 1999, 80, 2229–2260. [Google Scholar] [CrossRef]
  25. Levy, R.C.; Mattoo, S.; Munchak, L.A.; Remer, L.A.; Sayer, A.M.; Patadia, F.; Hsu, N.C. The Collection 6 MODIS aerosol products over land and ocean. Atmos. Meas. Tech. 2013, 6, 2989–3034. [Google Scholar] [CrossRef]
  26. Remer, L.A.; Kaufman, Y.J.; Tanré, D.; Mattoo, S.; Chu, D.A.; Martins, J.V.; Li, R.R.; Ichoku, C.; Levy, R.C.; Kleidman, R.G.; et al. The MODIS aerosol algorithm, products, and validation. J. Atmos. Sci. 2005, 62, 947–973. [Google Scholar] [CrossRef]
  27. Winker, D.M.; Vaughan, M.A.; Omar, A.; Hu, Y.; Powell, K.A.; Liu, Z.; Hunt, W.H.; Young, S.A. Overview of the CALIPSO mission and CALIOP data processing algorithms. J. Atmos. Ocean. Technol. 2009, 26, 2310–2323. [Google Scholar] [CrossRef]
  28. Bessho, K.; Date, K.; Hayashi, M.; Ikeda, A.; Imai, T.; Inoue, H.; Kumagai, Y.; Miyakawa, T.; Murata, H.; Ohno, T.; et al. An introduction to Himawari-8/9—Japan’s new-generation geostationary meteorological satellites. J. Meteorol. Soc. Jpn. Ser. II 2016, 94, 151–183. [Google Scholar] [CrossRef]
  29. Holben, B.N.; Eck, T.F.; Slutsker, I.; Tanré, D.; Buis, J.P.; Setzer, A.; Vermote, E.; Reagan, J.A.; Kaufman, Y.J.; Nakajima, T.; et al. AERONET—A federated instrument network and data archive for aerosol characterization. Remote Sens. Environ. 1998, 66, 1–16. [Google Scholar] [CrossRef]
  30. Ansmann, A.; Wandinger, U.; Riebesell, M.; Weitkamp, C.; Michaelis, W. Independent measurement of extinction and backscatter profiles in cirrus clouds by using a combined Raman elastic-backscatter lidar. Appl. Opt. 1992, 31, 7113–7131. [Google Scholar] [CrossRef] [PubMed]
  31. Hönninger, G.; von Friedeburg, C.; Platt, U. Multi axis differential optical absorption spectroscopy (MAX-DOAS). Atmos. Chem. Phys. 2004, 4, 231–254. [Google Scholar] [CrossRef]
  32. Wagner, T.; Dix, B.; Friedeburg, C.V.; Frieß, U.; Sanghavi, S.; Sinreich, R.; Platt, U. MAX-DOAS O4 measurements: A new technique to derive information on atmospheric aerosols—Principles and information content. J. Geophys. Res. Atmos. 2004, 109, D22205. [Google Scholar] [CrossRef]
  33. Irie, H.; Kanaya, Y.; Akimoto, H.; Iwabuchi, H.; Shimizu, A.; Aoki, K. First retrieval of tropospheric aerosol profiles using MAX-DOAS and comparison with lidar and sky radiometer measurements. Atmos. Chem. Phys. 2008, 8, 341–350. [Google Scholar] [CrossRef]
  34. Irie, H.; Takashima, H.; Kanaya, Y.; Boersma, K.F.; Gast, L.; Wittrock, F.; Brunner, D.; Zhou, Y.; Van Roozendael, M. Eight-component retrievals from ground-based MAX-DOAS observations. Atmospheric Meas. Tech. 2011, 4, 1027–1044. [Google Scholar] [CrossRef]
  35. Frieß, U.; Monks, P.S.; Remedios, J.J.; Rozanov, A.; Sinreich, R.; Wagner, T.; Platt, U. MAX-DOAS O4 measurements: A new technique to derive information on atmospheric aerosols: 2. Modeling studies. J. Geophys. Res. Atmos. 2006, 111, D14203. [Google Scholar] [CrossRef]
  36. Xin, J.; Gong, C.; Liu, Z.; Cong, Z.; Gao, W.; Song, T.; Pan, Y.; Sun, Y.; Ji, D.; Wang, L.; et al. The observation-based relationships between PM2.5 and AOD over China. J. Geophys. Res. Atmos. 2016, 121, 10701–10716. [Google Scholar] [CrossRef]
  37. Van Donkelaar, A.; Martin, R.V.; Brauer, M.; Hsu, N.C.; Kahn, R.A.; Levy, R.C.; Lyapustin, A.; Sayer, A.M.; Winker, D.M. Documentation for the global annual PM2.5 grids from MODIS, MISR and SeaWIFS aerosol optical depth (AOD) with GWR, 1998–2016. In NASA Socioeconomic Data and Applications Center (SEDAC) Data Set; Center for International Earth Science Information Network (CIESIN): New York, NY, USA, 2018; H4B27S72. [Google Scholar] [CrossRef]
  38. van Donkelaar, A.; Martin, R.V.; Brauer, M.; Hsu, N.C.; Kahn, R.A.; Levy, R.C..; Lyapustin, A.; Sayer, A.M.; Winker, D.M. Global estimates of fine particulate matter using a combined geophysical-statistical method with information from satellites, models, and monitors. Environ. Sci. Technol. 2016, 50, 3762–3772. [Google Scholar] [CrossRef] [PubMed]
  39. Zhang, S.; Wang, S.; Zhu, J.; Xue, R.; Jiang, Z.; Gu, C.; Yan, Y.; Zhou, B. Stacking machine learning models empowered high time-height-resolved ozone profiling from the ground to the stratopause based on MAX-DOAS observation. Environ. Sci. Technol. 2024, 58, 7433–7444. [Google Scholar] [CrossRef] [PubMed]
  40. Pan, Y.; Tian, X.; Xie, P.; Li, A.; Xu, J.; Ren, B.; Huang, X.; Tian, W.; Wang, Z. Prediction of Tropospheric NO2 Profile Using CNN-SVR-Based MAX-DOAS. Acta Opt. Sin. 2022, 42, 2401001. [Google Scholar] [CrossRef]
  41. Chen, G.; Li, S.; Knibbs, L.D.; Hamm, N.A.S.; Cao, W.; Li, T.; Guo, J.; Ren, H.; Abramson, M.J.; Guo, Y. A machine learning method to estimate PM2.5 concentrations across China with remote sensing, meteorological and land use information. Sci. Total Environ. 2018, 636, 52–60. [Google Scholar] [CrossRef]
  42. Peng, J.; Han, H.; Yi, Y.; Huang, H.; Xie, L. Machine learning and deep learning modeling and simulation for predicting PM2.5 concentrations. Chemosphere 2022, 308, 136353. [Google Scholar] [CrossRef]
  43. Xiao, Q.; Chang, H.H.; Geng, G.; Liu, Y. An ensemble machine-learning model to predict historical PM2.5 concentrations in China from satellite data. Environ. Sci. Technol. 2018, 52, 13260–13269. [Google Scholar] [CrossRef]
  44. Li, T.; Shen, H.; Yuan, Q.; Zhang, L. A locally weighted neural network constrained by global training for remote sensing estimation of PM2.5. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4102513. [Google Scholar] [CrossRef]
  45. Karimian, H.; Li, Q.; Wu, C.; Qi, Y.; Mo, Y.; Chen, G.; Zhang, X.; Sachdeva, S. Evaluation of different machine learning approaches to forecasting PM2.5 mass concentrations. Aerosol Air Qual. Res. 2019, 19, 1400–1410. [Google Scholar] [CrossRef]
  46. Di, Q.; Amini, H.; Shi, L.; Kloog, I.; Silvern, R.; Kelly, J.; Sabath, M.B.; Choirat, C.; Koutrakis, P.; Lyapustin, A.; et al. An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution. Environ. Int. 2019, 130, 104909. [Google Scholar] [CrossRef] [PubMed]
  47. Xing, C.; Liu, C.; Wang, S.; Hu, Q.; Liu, H.; Tan, W.; Zhang, W.; Li, B.; Liu, J. A new method to determine the aerosol optical properties from multiple-wavelength O4 absorptions by MAX-DOAS observation. Atmos. Meas. Tech. 2019, 12, 3289–3302. [Google Scholar] [CrossRef]
  48. Van Roozendael, M.; Hendrick, F.; Friedrich, M.M.; Fayt, C.; Bais, A.; Beirle, S.; Bösch, T.; Comas, M.N.; Friess, U.; Karagkiozidis, D.; et al. Fiducial reference measurements for air quality monitoring using ground-based MAX-DOAS instruments (FRM4DOAS). Remote Sens. 2024, 16, 4523. [Google Scholar] [CrossRef]
  49. Wu, H.; Xu, X.; Luo, T.; Yang, Y.; Xiong, Z.; Wang, Y. Variation and comparison of cloud cover in MODIS and four reanalysis datasets of ERA-interim, ERA5, MERRA-2 and NCEP. Atmospheric Res. 2023, 281, 106477. [Google Scholar] [CrossRef]
  50. Liu, C.; Xing, C.; Hu, Q.; Li, Q.; Liu, H.; Hong, Q.; Tan, W.; Ji, X.; Lin, H.; Lu, C.; et al. Ground-based hyperspectral stereoscopic remote sensing network: A promising strategy to learn coordinated control of O3 and PM2.5 over China. Engineering 2022, 19, 71–83. [Google Scholar] [CrossRef]
  51. Rodgers, J.L.; Nicewander, W.A. Thirteen ways to look at the correlation coefficient. Am. Stat. 1988, 42, 59–66. [Google Scholar] [CrossRef]
  52. Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
  53. Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
  54. Takkala, H.R.; Khanduri, V.; Singh, A.; Somepalli, S.N.; Maddieni, R.; Patra, S. Kyphosis disease prediction with help of randomizedsearchcv and adaboosting. In Proceedings of the 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), Bombay, India, 3–5 October 2022; IEEE: New York, NY, USA, 2022; pp. 1–5. [Google Scholar]
  55. Belitz, K.; Stackelberg, P.E. Evaluation of six methods for correcting bias in estimates from ensemble tree machine learning regression models. Environ. Model. Softw. 2021, 139, 105006. [Google Scholar] [CrossRef]
  56. Shao, H.; Li, H.; Jin, S.; Fan, R.; Wang, W.; Liu, B.; Ma, Y.; Wei, R.; Gong, W. Exploring the Conversion Model from Aerosol Extinction Coefficient to PM1, PM2.5 and PM10 Concentrations. Remote Sens. 2023, 15, 2742. [Google Scholar] [CrossRef]
  57. Wang, S.; Cuevas, C.A.; Frieß, U.; Saiz-Lopez, A. MAX-DOAS retrieval of aerosol extinction properties in Madrid, Spain. Atmos. Meas. Tech. 2016, 9, 5089–5101. [Google Scholar] [CrossRef]
  58. Lee, H.; Irie, H.; Gu, M.; Kim, J.; Hwang, J. Remote sensing of tropospheric aerosol using UV MAX-DOAS during hazy conditions in winter: Utilization of O4 absorption bands at wavelength intervals of 338–368 and 367–393 nm. Atmos. Environ. 2011, 45, 5760–5769. [Google Scholar] [CrossRef]
  59. Li, C.; Huang, Y.; Guo, H.; Wu, G.; Wang, Y.; Li, W.; Cui, L. The Concentrations and Removal Effects of PM10 and PM2.5 on a Wetland in Beijing. Sustainability 2019, 11, 1312. [Google Scholar] [CrossRef]
Figure 1. Geographic distribution of five MAX-DOAS instruments and three CNEMC stations. (a) Locations of MAX-DOAS instruments BJ-CAM and LZU and CNEMC stations 1006A and 1479A. (b) Locations of MAX-DOAS instruments AHU, HFC, and HF-HD and CNEMC station 1270A.
Figure 1. Geographic distribution of five MAX-DOAS instruments and three CNEMC stations. (a) Locations of MAX-DOAS instruments BJ-CAM and LZU and CNEMC stations 1006A and 1479A. (b) Locations of MAX-DOAS instruments AHU, HFC, and HF-HD and CNEMC station 1270A.
Remotesensing 17 03780 g001
Figure 2. S E T c c , S E A E R , and their corresponding fitting curves.
Figure 2. S E T c c , S E A E R , and their corresponding fitting curves.
Remotesensing 17 03780 g002
Figure 3. Flowchart of data processing and model development.
Figure 3. Flowchart of data processing and model development.
Remotesensing 17 03780 g003
Figure 4. Test set results: (a) LightGBM, (b) Random Forest, (c) XGBoost, and (d) BP.
Figure 4. Test set results: (a) LightGBM, (b) Random Forest, (c) XGBoost, and (d) BP.
Remotesensing 17 03780 g004
Figure 5. Test set results: (a) Before cloud removal. (b) After cloud removal.
Figure 5. Test set results: (a) Before cloud removal. (b) After cloud removal.
Remotesensing 17 03780 g005
Figure 6. Spatial site cross-validation results: (a) AHU, (b) BJ-CAM, and (c) LZU.
Figure 6. Spatial site cross-validation results: (a) AHU, (b) BJ-CAM, and (c) LZU.
Remotesensing 17 03780 g006
Figure 7. (a) Feature importance contribution results of the model before feature vector optimization. Test results (b) before and (c) after feature vector optimization. Test results after removing (d) the SAA feature vector and (e) the high-angle O4 dSCD feature vector.
Figure 7. (a) Feature importance contribution results of the model before feature vector optimization. Test results (b) before and (c) after feature vector optimization. Test results after removing (d) the SAA feature vector and (e) the high-angle O4 dSCD feature vector.
Remotesensing 17 03780 g007
Figure 8. Monthly averaged PM2.5 concentration time series fitted at different monitoring directions of (a) HF-HD and (b) HFC.
Figure 8. Monthly averaged PM2.5 concentration time series fitted at different monitoring directions of (a) HF-HD and (b) HFC.
Remotesensing 17 03780 g008
Figure 9. PM2.5 concentration distribution at different directions of the HF-HD and HFC sites.
Figure 9. PM2.5 concentration distribution at different directions of the HF-HD and HFC sites.
Remotesensing 17 03780 g009
Figure 10. Proportions of different PM2.5 concentration values in various directions from HF-HD and HFC in each season: (a) HF-HD. (b) HFC.
Figure 10. Proportions of different PM2.5 concentration values in various directions from HF-HD and HFC in each season: (a) HF-HD. (b) HFC.
Remotesensing 17 03780 g010
Figure 11. (a) Variations in PM2.5 concentration in the east–west directions of HF-HD and the west direction of HFC on 23 December 2021. (b) Spatial distribution of PM2.5 concentrations along the east–west directions of HF-HD and the west direction of HFC at selected times on 23 December.
Figure 11. (a) Variations in PM2.5 concentration in the east–west directions of HF-HD and the west direction of HFC on 23 December 2021. (b) Spatial distribution of PM2.5 concentrations along the east–west directions of HF-HD and the west direction of HFC at selected times on 23 December.
Remotesensing 17 03780 g011
Table 1. Information on the MAX-DOAS sites.
Table 1. Information on the MAX-DOAS sites.
MAX-DOAS Site NameLatitude and LongitudeObservation TimeElevation Angle SequenceObservation Azimuth Direction
BJ-CAM116.32E, 39.95N15 April 2018–26 May 20231°, 2°, 3°,
4°, 5°, 6°,
8°, 10°, 15°,
30°, 90°
130°
LZU103.85E, 36.04N20 October 2018–10 October 2022287°
AHU117.18E, 31.77N22 October 2020–17 April 2023107°
HFC117.26E, 31.76N19 January 2021–25 July 2022180°, 270°
HF-HD117.31E, 31.88N28 January 2021–14 May 20220°, 90°, 180°, 270°
Table 2. Parameter settings of each model.
Table 2. Parameter settings of each model.
ModelParametersData Sources
Random Forestn_estimators100
max_depth15
min_samples_split2
min_samples_leaf1
max_featuressqrt
n_jobs1
XGBoostn_estimators200
max_depth6
learning_rate0.05
subsample0.8
colsample_bytree0.8
gamma0.1
LightGBMn_estimators200
max_depth10
learning_rate0.05
subsample0.8
num_leaves31
colsample_bytree0.8
min_child_samples20
BPhidden_layer_sizes(64, 32, 16)
activationrelu
solveradam
alpha1 × 10−4
learning_rate_init0.001
max_iter1000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qin, H.; Li, Q.; Xia, S.; Zhang, Z.; Hu, Q.; Tan, W.; Guo, T. A Novel Framework Integrating Spectrum Analysis and AI for Near-Ground-Surface PM2.5 Concentration Estimation. Remote Sens. 2025, 17, 3780. https://doi.org/10.3390/rs17223780

AMA Style

Qin H, Li Q, Xia S, Zhang Z, Hu Q, Tan W, Guo T. A Novel Framework Integrating Spectrum Analysis and AI for Near-Ground-Surface PM2.5 Concentration Estimation. Remote Sensing. 2025; 17(22):3780. https://doi.org/10.3390/rs17223780

Chicago/Turabian Style

Qin, Hanwen, Qihua Li, Shun Xia, Zhiguo Zhang, Qihou Hu, Wei Tan, and Taoming Guo. 2025. "A Novel Framework Integrating Spectrum Analysis and AI for Near-Ground-Surface PM2.5 Concentration Estimation" Remote Sensing 17, no. 22: 3780. https://doi.org/10.3390/rs17223780

APA Style

Qin, H., Li, Q., Xia, S., Zhang, Z., Hu, Q., Tan, W., & Guo, T. (2025). A Novel Framework Integrating Spectrum Analysis and AI for Near-Ground-Surface PM2.5 Concentration Estimation. Remote Sensing, 17(22), 3780. https://doi.org/10.3390/rs17223780

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop