Emission-Based Machine Learning Approach for Large-Scale Estimates of Black Carbon in China

: Tremendous efforts have been made to construct large-scale estimates of aerosol components. However, Black Carbon (BC) estimates over large spatiotemporal scales are still limited. We proposed a novel approach utilizing machine-learning techniques to estimate BC on a large scale. We leveraged a comprehensive gridded BC emission database and auxiliary variables as inputs to train various machine learning (ML) models, specifically a Random Forest (RF) algorithm, to estimate high spatiotemporal BC concentration over China. Different ML algorithms have been applied to a large number of potential datasets and detailed variable importance and sensitivity analysis have also been carried out to explore the physical relevance of variables on the BC estimation model. RF algorithm showed the best performance compared with other ML models. Good predictive performance was observed for the training cases (R 2 = 0.78, RMSE = 1.37 µ gm − 3 ) and test case databases (R 2 = 0.77, RMSE = 1.35 µ gm − 3 ) on a daily time scale, illustrating a significant improvement compared to previous studies with remote sensing and chemical transport models. The seasonal variation of BC distributions was also evaluated, with the best performance observed in spring and summer (R 2 ≈ 0.7–0.76, RMSE ≈ 0.98–1.26 µ gm − 3 ), followed by autumn and winter (R 2 ≈ 0.7–0.72, RMSE ≈ 1.37–1.63 µ gm − 3 ). Variable importance and sensitivity analysis illustrated that the BC emission inventories and meteorology showed the highest importance in estimating BC concentration (R 2 = 0.73, RMSE = 1.88 µ gm − 3 ). At the same time, albedo data and some land cover type variables were also helpful in improving the model performance. We demonstrated that the emission-based ML model with an appropriate auxiliary database (e.g., satellite and reanalysis datasets) could effectively estimate the spatiotemporal BC concentrations at a large scale. In addition, the promising results obtained through this approach highlight its potential to be utilized for the assessment of other primary pollutants in the future.


Introduction
Black Carbon (BC) aerosols play a crucial role in atmospheric aerosols, resulting from the incomplete combustion of carbonaceous materials.Despite their relatively small proportion of atmospheric aerosols (up to 10%), BC particles have significant environmental, climatic, and health impacts.They can absorb solar radiation across all wavelength regions and contribute positively to global warming [1,2].Fresh BC particles undergo aging processes in the atmosphere, which could alter their properties and radiative forcing as they interact with various chemical substances, affecting their hydrophobicity and radiative characteristics [3,4].They can also become cloud condensation nuclei, which affects the formation of clouds and rainfall, making it easier for them to be scavenged by wet precipitation and thus shortening the residence time in the atmosphere [5].BC has been demonstrated to exert a higher direct radiative forcing than methane, making it one of the most essential warming agents after carbon dioxide [2].
Epidemiological studies have also linked BC exposure to adverse health effects and premature mortality, particularly concerning the cardiovascular system [6].There is a critical need to estimate the temporal and spatial variation of BC concentration in high resolution on a large scale.In recent decades, considerable efforts have been made to estimate PM 2.5 species concentrations by ground-based remote sensing that incorporated complex refractive index, particle size distribution, and single-scattering albedo of aerosols to invert BC concentrations and other aerosol scattering components [7][8][9][10].These studies have provided valuable in-situ measurements of aerosol concentrations and their optical properties.However, ground-based instruments are typically located at specific monitoring sites, providing localized measurements that limit our capabilities to capture the spatial variability of BC aerosols over larger areas.
Over larger spatial scales, satellite remote sensing technology can detect aerosols without contacting the target, and it is the primary tool for monitoring the spatial-temporal distribution of aerosols on a global scale.Multi-angle polarization observation is an effective way to quantify carbonaceous aerosols [11][12][13].The algorithm avoids multiple assumptions and achieves multi-parameter retrieval, including BC concentration.However, lower portability and limited temporal resolution restrict further historical analysis and assessment applications.There have been some efforts to inverse BC by using single-view satellites (i.e., MODIS) with limited observations [11,[14][15][16].Bao et al. (2020) proposed a new algorithm that achieves reasonable BC quantification by using aerosol optical depth (AOD) as a known input to address the issues of inadequate observations for single-view sensors.Gogoi et al. (2023) utilized observations from the Cloud and Aerosol Imager-2 (CAI-2) on board the Greenhouse gases Observing Satellite-2 (GOSAT-2) to retrieve BC concentrations using the multi-wavelength and multi-pixel method (MWPM) over India, and the validated algorithm was extended to elucidate global BC features [17].Good performance of these remote sensing algorithms could only be achieved under high aerosol loading conditions (e.g., AOD > 0.5), which may lead to discrepancies over low aerosol conditions.
Previous research shows that the quantity and impacts of aerosol components in a specific receptor region could be controlled by various factors, including their spatial dependence on primary sources, the global variation and dispersion of emissions, meteorological factors that influence source-receptor pathways, and the physic-chemical properties of the aerosol that affects the mixing state and interactions with other species [18].Studies using the global aerosol climate models [19][20][21][22] and chemical transport models [23][24][25][26][27] have also been evaluated to quantify long-term high-resolution aerosol speciation concentrations in various parts of the world.However, these modeling approaches often suffer from significant biases (R 2 ≈ 0.4-0.6 in the studies mentioned above) and uncertainties, particularly in capturing the spatial and temporal variability of aerosols with sufficient horizontal resolutions and accuracy, as research suggests that the performance of these models could greatly rely on the parameterization scheme uncertainties as well as the emission inventories [28].
Some research incorporated satellite-based AOD and aerosol compositions from the CTMs to estimate the PM 2.5 components [28,29].For instance, Geng et al. (2018) derived PM 2.5 chemical composition by coupling AOD with conversion factors obtained by the GEOS-Chem model [30].In some other studies, for instance, ground-level observations, meteorological data, and land cover type variables were used to reconstruct and downscale the chemical transport models.Di et al. (2016) incorporated a neural network to recalibrate GEOS-Chem estimated PM 2.5 components and improved their spatial resolution by integrating land use variables [31].In addition, Meng et al. (2018) applied machine-learning algorithms to estimate PM 2.5 , sulfate, nitrate, OC, and EC across the United States [32].Although these applications have improved the spatial resolution and agreement between the PM 2.5 speciation concentrations, temporal variabilities in PM 2.5 species cannot be suffi-ciently explained, and these methods usually provided low performance in temporal scales (e.g., R 2 ≈ 0.4-0.5 for different species).The reasons behind the poor performance of these applications might be due to the scarcity of aerosol (e.g., BC) ground-level concentration monitors that hinder the valid evaluation of the model performance with real-world conditions.In addition, the variations and distributions of the actual emission inventories could significantly influence the mixing state of the model and cause significant discrepancies between the modeled and measured BC concentrations [18].Meteorology was also explained as one of the important factors controlling the model performance [33][34][35].Although these applications could significantly improve our understanding of BC's large-scale spatial and temporal variability, obtaining higher spatiotemporal resolution BC estimation with sufficient accuracy is still challenging.Therefore, there is an urgent need to develop a new method for estimating BC on a large scale, especially for China, where rapid economic growth and strict emission control have been experienced in the past decade.
The application of machine learning (ML) techniques in the estimation of atmospheric pollution in China has developed rapidly in the past few years, with the accumulated "big data" in the past ten years from established observational networks and satellite remote sensing products [36,37].ML techniques are expected to perform better than statistical methods in dealing with the complex nonlinear relationships between various environmental parameters [36].Several researchers have reported significantly higher performance of ML approaches than conventional statistical approaches and CTMs for PM 2.5 estimation [36,38].However, there are limited studies to estimate BC concentration at a large scale by integrating machine learning methods.While satellite remote sensing and numerical simulation provide essential insights into aerosol distributions, it is crucial to address the limitations of these methods, including their dependence on high aerosol loading conditions and limited model uncertainties.Integrating ground-based measurements, satellite data, and advanced machine learning techniques offers a promising approach to overcome these limitations and improve the accuracy of BC estimations at high spatiotemporal resolutions.
In this study, we use the gridded BC emission inventory database and our accumulated ground-based BC monitoring networks over China combined with machine-learning algorithms and satellite remote sensing and reanalysis database to estimate large spatiotemporal scale BC concentrations over China.In this study, multiple machine learning algorithms have been applied based on the available large-scale database to retrieve and validate ground-level BC concentration.Detailed variable importance and sensitivity analysis have also been carried out to explore the physical relevance of variables on large-scale BC estimation models.

BC Ground-Based Measurements
China has established various air quality monitoring networks, including BC monitoring networks, operating continuously over various administrative centers and municipal districts.The network can provide adequate data on the concentration of ground-based BC.The data used in this study were 5 min resolution BC concentration data obtained from the China Meteorological Administration (CMA) from 42 sites across China for one year from 1 January 2015 to 31 December 2015 (Figure S7).BC monitoring in each of the 42 monitoring areas was carried out by AE31 aethalometers (Magee Scientific, Berkeley, CA, USA).The instrument measures light attenuation at 370, 470, 520,590, 660, 880, and 950 nm wavelengths [2,39].The light absorption of BC particles at 880 nm wavelength is much greater than that of aerosol particles, dust, and brown carbon (BrC).Therefore, the actual BC concentrations were calculated based on the 880 nm wavelength.The data quality of our BC ground-based monitoring database was evaluated before the modeling process.All the data points containing missing values were excluded from the analysis.The outliers beyond ± 3 standard deviations were removed.We also evaluated the representativeness of the data by examining the distribution and characteristics of the samples.

Satellite Products
Various satellite datasets have been analyzed and used in this study, including the aerosol parameters, land use data, and albedo data from MODIS sensors.MODIS sensors onboard Terra and Aqua satellites can cover the earth every 1 or 2 days.MODIS sensors comprise 36 spectral bands, which could be further processed to derive other products.Detailed information on MODIS data processing can be found elsewhere [40].The available albedo database obtained in this study contained data on forecast albedo (Fal) and nearinfrared albedo for defuse radiation (Alind).We also brought leaf area vegetation indices, including leaf area index high vegetation (lai_hv) and leaf area index low vegetation (lai_lv) in this study.
Land use data obtained from MODIS has been utilized as it could provide coarse information on the potential emission types, which has proven helpful in PM2.5 estimation based on ML [41] or other statistical methods [28,42,43].Land Cover Type, Ver 6, provided by MODIS, was used in this study.This database has a temporal resolution of 1 year and a spatial resolution of 500 m.We utilized the annual International Geosphere-Biosphere Programme (IGBP) classification, which divides the earth's surface into 17 categories (hereafter called type 1-17) [44].This research calculated and used the ratio containing each land use category.Detailed information on the IGBD classification of land cover types can be found in Sidhu et al. (2017) [45].Figure S1 in the Supplementary Materials illustrates the distributions of different land cover types obtained in this study.All the land cover types and descriptions are illustrated in Table S1.

Meteorological and Other Parameters
Various types of meteorological variables, including potential temperature (pt), relative humidity (RH), surface pressure (sp), evaporation (e), boundary layer height (blh) (>200), boundary layer dissipation (bld), forecast surface roughness (fsr), as well as the wind speed and wind direction data have been included in the model training.BC particles also absorb solar radiation, mainly in the visible and infrared wavelength regions.Therefore, we could expect that the data on long-and short-wavelength radiation and heat flux may also contain BC information.Thus, the radiation and heat flux parameters were also tested in addition to the meteorological parameters.The data variables presented here are hourly readings from the ERA-Interim atmospheric reanalysis products.The data is the most up-to-date global atmospheric reanalysis produced by the European Centre for Medium-Range Weather Forecasts (ECMWF), spanning from 1 January 1989 to the present day and continuously updated in near-real time [46].The spatial resolution of this data is 0.25 × 0.25 degrees.It is important to note that RH and Wind fields were calculated using the Equations outlined in (S1)-(S7) of the Supplementary Materials.
We also added latitude, longitude, and the day of the year to our analysis since these data contain information on spatial variability, environmental factors, and seasonality, which could allow the model to make more accurate predictions across different locations and periods as described in some previous publications for PM 2.5 estimation [38,47].Table S2 in the Supplementary Materials illustrates the details of all the final variables included in the analysis.

Gridded Emission Inventory
In this study, we innovatively introduce the application of a multi-resolution BC gridded emission inventory database, which includes emission information of BC from different sectors as the key variables to estimate BC concentrations across China in the machine learning method.Emission sources play a dominant role in determining the spatial distribution of primary pollutants since their lifetime in the atmosphere is short.Thus, they are less likely to experience long-range transport [48].BC emission source data obtained in this study were calculated using the space-for-time substitute algorithm developed by Peking University [49,50] to calculate monthly residential fuel consumption, which is conventionally designed for the chemical transport models (CTM).The database contained monthly resolution gridded emission inventories of BC from different sectors, including energy sector emissions, residential sector emissions, industrial emissions, wildfire emissions, and other sectors in China.We also included the gridded total sum of the BC emission inventories data in our analysis.The emission inventory database can offer extensive coverage of global emission inventories [51].This emission source database was utilized in this research mainly because the data can provide detailed biomass burning and wildfire emissions other than industrial and residential sources.BC emission source data used in this study have a total spatial resolution of 0.1 degrees and a temporal resolution of 1 month (Figure S2 and Table S2).

Data Integration and Database Creation
The study utilized a database with diverse spatial and temporal resolutions.For instance, the BC emission inventories had a monthly resolution of 0.25 • × 0.25 • , while the MODIS land cover type data had an annual resolution of 500 m.On the other hand, meteorological data had an hourly resolution of 0.25 • × 0.25 • , and AOD data had daily resolutions of 1 km.Consequently, several data transformations and resampling were necessary to create our final database, which estimates daily ground-based BC concentrations with 0.25 • × 0.25 • spatial resolutions.
To create a gridded database for estimating BC concentrations on a large scale in China, we resampled all predictor variables to ensure their uniform distribution.Our goal was to achieve a spatial resolution of 0.25 • × 0.25 • and a temporal resolution of 1 day, resulting in nationwide variable datasets.We utilized resampling algorithms to match and correspond these datasets to our meteorological data at the desired spatial resolution.For MODIS land cover type data, we performed zonal statistics algorithms to calculate the ratio of each land cover category associated with our desired spatial resolution.The MODIS land cover type data and BC emission inventories were unavailable at the daily timescale.Therefore, we transferred the annual/monthly data to daily-scale estimates by assuming that the land cover/emission and factors influencing them remain relatively constant.While this assumption generally holds true for land cover, emissions may not always be the case, as they can be subject to more dynamic variations.However, it is worth noting that this assumption is commonly adopted in conventional numerical model simulations of BC, as BC variations are more influenced by meteorological variations rather than emission change at a daily scale.
To build our training, cross-validation, and testing case database to train our machine learning model, we generated and corresponded the average value of all the predictor variables within the 5-km radius of the BC monitoring sites.We obtained the value related to the nearest site for other variables with coarser spatial resolution.The distance between BC monitoring sites and other variables was derived using the Haversine algorithm according to the latitude and the longitude of the two points.Detailed information and related equations regarding integrating our training database can be found in Supplementary Materials Section S2 and Equations (S8)-(S10).The list of all the predictive variables used to construct our machine-learning algorithms, including their classifications, detailed resolution, and data sources, can be found in Table S2 in the Supplementary Materials.

Machine Learning Algorithms
We utilized various machine learning algorithms, including Random Forest (RF), as well as different boosting models (i.e., Extreme Gradient Boosting (XGBoost), Stochastic Gradient Boosting (SGB), and Boosted Trees (BT)) to estimate BC ground-based concentrations based on the predictor variables obtained in Table S2 and described in the previous sections.

Boosting Algorithms
Boosting is a recognized method in machine learning, founded as an ensemble algorithm for classification and regression in supervised machine learning.Boosting is a practice that works on a family of 'weak learners' (i.e., a model that produces predictions that are slightly correlated with the response variable) to convert them into strong learners (i.e., a model that is highly associated with the response) [52].Gradient boosting machines are a series of machine-learning algorithms that have shown substantial achievement in many practical applications.They are highly customizable to the application's requirements, like learning about different loss functions [53].In gradient boosting machines or GBMs, the learning process sequentially fits new models to estimate the target variable accurately.The main idea behind this algorithm is to construct the new base learners to reach maximum correlation with the negative gradient of the loss function associated with the whole ensemble.Herein, a series of different gradient boosting algorithms, including a Stochastic Gradient Boosting Machine (SGB), Extreme Gradient Boosting Machines (XG-Boost), Boosted Trees (BT), and Random Forest (RF) algorithms have been implemented.The performance of each of these ML algorithms has been evaluated.

Random Forest (RF)
Random forest is an ML algorithm based on ensemble learning and composed of multiple decision trees [54].Each decision tree is considered a classifier or a regressor, with N decision trees having N results.For the regression, the random forest will integrate all the regression results given by each decision tree and take the average value as the final output.Random forest divides the training dataset into N samples by sampling with replacement, and a decision tree makes each sample.A node on each decision tree represents a feature test.The output of the decision tree is averaged as the final output.The advantage of random forest is that it cannot only process high-dimensional data but also has fast classification speed and high accuracy.The model can also provide a measure of the contribution of a variable called "importance," which represents the sum of the Gini coefficient index (GI) of all nodes split on the feature.The GI index is described in Equation ( 1): where n is the number of categories and is the sample weight of each category, the importance of each feature on each node is the change in GI index before and after the node branch [47].According to the variable contribution index provided by the random forest, it is possible to know the variables with more significant contributions to the estimation of the random forest.

Model Training, Cross-Validation, and Testing
The database obtained in this study contains data on all the predictor variable databases (i.e., meteorology, land used data, AOD, LAI, albedo, and gridded emission inventories) as described in detail in Sections 2.1 and 2.2.Detailed information on the predictor variables and their sources can be found in Supplementary Materials Table S2.Our database contained approximately 5000 data points that were analyzed based on the procedures described in the previous sections to create our training, cross-validation, and testing case database.80% of the entire database was selected as training and cross-validation (CV) cases, while 20% of the data were used for model testing performance.The testing-case database was new data not included in the model for training and cross-validation.We used the 10-fold validation method to evaluate the accuracy of estimating BC concentration.The verification method randomly selects 90% of the training samples for model fittings and 10% for verification.This operation was repeated ten times to guarantee all the samples were involved, and the average coefficient of determination obtained every time was used as the verification results.
We utilized grid search and random search strategies to optimize the performance of our machine-learning models.Grid search involves a systematic approach to evaluating different hyperparameter combinations on a pre-defined grid, while random search tries out randomly selected hyperparameters within a specified range.The grid search approach has been applied to tune RF hyperparameters.In contrast, a randomized search strategy has been carried out for hyperparameter tuning of other ML approaches used in this research.
To ensure the highest level of accuracy, we conducted a 10-fold cross-validation repeated ten times for the hyperparameter tuning process.The optimal combination of hyperparameters was determined by maximizing the model's performance by calculating R 2 and minimizing the root mean squared error (RMSE) values.By implementing these rigorous processes, we identified the most effective hyperparameters for our machinelearning models, leading to improved accuracy and better predictive power.Table S3 in the Supplementary Materials illustrates the tuned optimal hyperparameters of all the ML algorithms presented in this research.Training, cross-validation, testing database, and hyperparameter tuning processes were performed by the Caret package [55], implemented in R (Version 4.3).The schematic flowchart describing the process of constructing our ML algorithms has been illustrated in Figure S3.

Variable Importance Measures
Different ML methods provide their variable importance measures to draw specific features of the structure of the model.On the other hand, tree-based models, including RF out-of-the-bag data, could be used as the model's variable importance, as described in the previous section.
The drawback of the model-specific variable importance (as calculated by the RF model) is the lack of interpretability, as other ML algorithms do not provide relative and direct variable importance measures for interpretation.In this case, model-agnostic or other independent variable importance methods are essential as they allow a side-by-side comparison of the explanatory variable's importance among ML algorithms with different configurations.Herein, we calculated a model-agnostic permutation-based method that does not assume anything about the structure of our predictive models.The fundamental idea of this approach is to calculate the model's performance change after permuting the explanatory variables.This approach has been described elsewhere in more detail by Fisher et al. (2019) [56].The importance measure of the permutation-based variable can be summarized in Equation (2).The results will be explained in more detail in the following sections.
where L*j corresponds to the loss function that quantifies the goodness-of-fit of the ML algorithm, X*j is the matrix of our observed variables after the permutation of the jth column of the matrix X (i.e., by permuting the observed values of X*j).ŷ*j denotes the matrix of the model predictions based on the modified data of X*j.Herein, we applied 50 permutations to ensure sufficient randomness and assess the associated uncertainty of the calculated variable-importance values.To compare the most effective common features from the permutation-based models, we first evaluated the impact of each feature drawn from the ML models by calculating the percentage change of RMSE from the baseline for each of the features and subsequently compared the top common features of each of the ML models among the features with >10% change in the model RMSE.In this study, we also calculated the importance of the individual RF-specific variable, and the results are provided in Figure S4 in the Supplementary Materials.

Model Performance
Training and cross-validation databases were used to develop SGB, BT, XGBosst, and RF algorithms to train ground-level BC concentrations following the repeated 10-fold CV procedure described in the previous section.The results of the variations of the performance measures of our applied models for BC estimations are shown in Figure 1.The SGB model observed relatively lower performance than those of the other ML algorithms (i.e., RF, BT, and XGBoost).R 2 of the SGB model illustrated an average value of 0.69 (RMSE = 1.98).Other ML algorithms showed higher performance for the estimation of BC.For instance, XGBoost explained an R 2 = 0.74 (RMSE = 1.79).On the other hand, the RF model illustrated the highest performance (R 2 = 0.78, RMSE = 1.69) for BC estimation (Figure 1).agreement and a low bias between the estimated and measured BC (R 2 = 0.78, RMSE = 1.37 μgm −3 , Slope = 0.72).Further analysis of our RF model also illustrated an excellent performance for the test-case database (R 2 = 0.77, RMSE = 1.37 μgm −3 , Slope = 0.63).In addition, we also compared the performance of our RF model for BC estimations with our previous physical BC retrieval method, and our RF algorithm outperformed the currently available BC retrieval approach purely based on satellite remote sensing (R 2 = 0.545) described in detail in Bao et al. (2020) [15].Figure 2c shows the averaged time series of estimated versus measured BC concentrations across all the stations in China.In general, the time series of the estimated BC across China agreed reasonably well with the measured BC concentrations.However, we observed a slight model underestimation, mainly occurring during wintertime haze and the heavy pollution episodes over China.In addition, we also compared the performance of our RF model for BC estimations with our previous physical BC retrieval method, and our RF algorithm outperformed the currently available BC retrieval approach purely based on satellite remote sensing (R 2 = 0.545) described in detail in Bao et al. (2020) [15].Figure 2c shows the averaged time series of estimated versus measured BC concentrations across all the stations in China.In general, the time series of the estimated BC across China agreed reasonably well with the measured BC concentrations.However, we observed a slight model underestimation, mainly occurring during wintertime haze and the heavy pollution episodes over China.

Seasonal Model Validation
Figure 3 illustrates the scatter plots of the model performance for different seasons provided by daily averaged data.The R 2 of the four seasons ranges from 0.68 to 0.79.We observed the highest model performance of BC during spring (R 2 = 0.76, RMSE = 1.26, slope = 0.73) and summer (R 2 = 0.68, RMSE = 0.98, slope = 0.76).A slightly diminished model performance was observed for BC performance during autumn (R 2 = 0.79, RMSE = 1.37, slope = 0.67) and winter (R 2 = 0.72, RMSE = 1.63, slope = 0.66), which may be due to the scattered BC emission sources and frequent high pollution events occurrence resulting in more considerable variations across the space and time and therefore decreases the agreement of the model.The performance of the RF model during the seasonal scale again showed better performance than our previous remote sensing BC model described in Bao et al. (2020) [15].

BC Spatial Distribution
The constructed RF model was applied to estimate the daily BC near-surface concentration across mainland China.Daily BC estimates were then consolidated into monthly, seasonal, and annual averages to map the spatial distributions of BC concentrations.Figure 4 illustrates the spatial distribution map of the large-scale annual average BC concentration across China for 2015 with a spatial resolution of 0.25 • × 0.25 • .BC concentration covered most regions nationwide except some small areas represented in white.Circles with black outlines show the annual average BC concentration observations.We generally observed a good agreement with a negligible bias between the stations; however, several stations experienced noticeable underestimation.The spatial distribution pattern of BC concentration estimated by the RF model roughly complied with the BC emission inventory data to some extent but with a notable difference (Figure S2).It can be observed that some regions (e.g., Pearl River Delta (PRD) and Yangtze River Delta (YRD) regions), which were shown to be hot spot locations in gridded BC emission inventories (Figure S2), do not correspond entirely to the spatial BC distributions.This has also been observed in the scatter plots of model performance (Figure 2), where some data points did not directly correspond with the estimated BC and the gridded total BC emission inventories.This condition was usually more noticeable during spring and summer (Figure 3), which could be affected by other factors, such as local meteorological conditions (e.g., Southerlies monsoon clean air effect for the emissions in the delta region in summer).This phenomenon implies that the RF works well with the nonlinear relationship between emission information and pollution concentration.

BC Spatial Distribution
The constructed RF model was applied to estimate the daily BC near-surface concentration across mainland China.Daily BC estimates were then consolidated into monthly, seasonal, and annual averages to map the spatial distributions of BC concentrations.Figure 4 illustrates the spatial distribution map of the large-scale annual average BC concentration across China for 2015 with a spatial resolution of 0.25° × 0.25°.BC concentration covered most regions nationwide except some small areas represented in white.Circles with black outlines show the annual average BC concentration observations.We generally observed a good agreement with a negligible bias between the stations; however, several stations experienced noticeable underestimation.The spatial distribution pattern of BC concentration estimated by the RF model roughly complied with the BC emission inventory data to some extent but with a notable difference (Figure S2).It can be observed that some regions (e.g., Pearl River Delta (PRD) and Yangtze River Delta (YRD) regions), which were shown to be hot spot locations in gridded BC emission inventories (Figure S2), do not correspond entirely to the spatial BC distributions.This has also been observed in the scatter plots of model performance (Figure 2), where some data points did not directly correspond with the estimated BC and the gridded total BC emission inventories.This condition was usually more noticeable during spring and summer (Figure 3), which could be affected by other factors, such as local meteorological conditions (e.g., Southerlies monsoon clean air effect for the emissions in the delta region in summer).This phenomenon implies that the RF works well with the nonlinear relationship between emission information and pollution concentration.Higher annual mean BC concentrations are mainly distributed over Beijing-Tianjin-Hebei, Shanxi Province, Northeast China, and Central China, which contain many coal mining and combustion activities [57].At the same time, higher BC values in Yunnan Province and Sichuan Basin would be related to biomass-burning activities [58,59].The average estimated BC concentration based on the RF model was 2.71 ± 1.18 μgm −3 .

Spatial Distribution of Seasonal BC Concentration
Figure 5 shows the seasonal average spatial distribution of BC over mainland China.Significant seasonal variation was observed for BC concentrations.The average BC concentration in spring, summer, autumn, and winter was 2.66 ± 1.21 μgm −3 , 2.54 ± 1.07 μgm −3 , Higher annual mean BC concentrations are mainly distributed over Beijing-Tianjin-Hebei, Shanxi Province, Northeast China, and Central China, which contain many coal mining and combustion activities [57].At the same time, higher BC values in Yunnan Province and Sichuan Basin would be related to biomass-burning activities [58,59].The average estimated BC concentration based on the RF model was 2.71 ± 1.18 µgm −3 .

Spatial Distribution of Seasonal BC Concentration
Figure 5 shows the seasonal average spatial distribution of BC over mainland China.Significant seasonal variation was observed for BC concentrations.The average BC concentration in spring, summer, autumn, and winter was 2.66 ± 1.21 µgm −3 , 2.54 ± 1.07 µgm −3 , 2.76 ± 1.19 µgm −3 , and 3.05 ± 1.42 µgm −3 respectively.We observed the highest BC concentration in winter, which is possibly due to the heating demands from the combustion of fossil fuels over northern China and also the stability of meteorological conditions during the winter (low wind speed and temperature inversion) that prevents the pollution from dispersion and causes the pollutions to be trapped in the area.We observed the lowest BC concentration in summer, mainly associated with the lower fossil fuel combustion and the prevailing southerly wind, which brings clean air from the ocean to inland in summer.We also observed that the BC concentration in North China Plain (NCP) is higher in autumn than in spring, which is probably related to straw combustion in North China, which generally occurs during September and October after the autumn harvest [60,61].Extensive missing data observed over northeastern China during the winter season would be due to the missing values for the satellite AOD data (Figure 5d).Comparison of the BC distribution derived from the RF model and the satellite inversions in our previous work [16] (Figure S5) demonstrates that the RF model showed significantly more detailed spatial and seasonal BC concentration variations.The larger background BC values observed in our RF model compared to our previous satellite-based method would be because the satellite-based method is generally valid only for highly polluted conditions (AOD > 0.5).However, the other restriction in our RF approach is that the new model tends to show lower BC concentration over northeastern China than that observed in our satellite-based method although the area has a higher frequency of fire events.This discrepancy would be due to the limited number of ground-based observations over northern China and therefore lower training rate of our RF model specific to that region.

Variable Importance
Figure 6 and Table S4 illustrate the top 20 most influential variables calculated from the model-agnostic permutation-based method, as discussed in Section 2.5 for all the models studied in this research for BC estimations.The shifted y-axis represents the baseline calculated for each model.The SGB model demonstrated the highest RMSE value, resulting in a more significant shift in the y-axis than the other ML approaches.RF and XGBoost models, on the other hand, showed the lowest baseline among different models for both BC estimations, suggesting that the relatively higher median absolute residuals observed in our other ML approaches (SGB and BT) are probably impacting the baseline (Figure S6).The distribution of the residuals associated with the probability of attrition to the measured BC was calculated in Figure S6.The median residuals are significantly lower for the RF than the other ML algorithms evaluated in this research.The RF model performed relatively better in handling the residuals than the XGBoost model.Boxplots illustrated that the RF model also illustrated relatively lower median absolute residuals for BC (Figure S6).
Furthermore, closely evaluating the most influential variables drawn from the models illustrated that although all of our models are picking up a unique construction in terms of the essential variables, there are common features that were consistently found influential in all of the models for BC estimations demonstrating further evidence that these variables have robust predictive signals (Figure 6).In general, we observed that BC emission inventories and meteorological data were the major contributors (more than 50% of total predictive variables) to the whole important variable list of all the ML predictive models further illustrating their impacts on the BC predictive signals.We can expect that these variables would substantially influence the ML estimation performance (Figure 6) (This will be explained in more detail in the following section).
Remote Sens. 2024, 16, x FOR PEER REVIEW 12 of 19 lower BC concentration over northeastern China than that observed in our satellite-based method although the area has a higher frequency of fire events.This discrepancy would be due to the limited number of ground-based observations over northern China and therefore lower training rate of our RF model specific to that region.

Variable Importance
Figure 6 and Table S4 illustrate the top 20 most influential variables calculated from the model-agnostic permutation-based method, as discussed in Section 2.5 for all the models studied in this research for BC estimations.The shifted y-axis represents the baseline calculated for each model.The SGB model demonstrated the highest RMSE value, resulting in a more significant shift in the y-axis than the other ML approaches.RF and XGBoost models, on the other hand, showed the lowest baseline among different models for both BC estimations, suggesting that the relatively higher median absolute residuals observed in our other ML approaches (SGB and BT) are probably impacting the baseline (Figure S6).The distribution of the residuals associated with the probability of attrition to the measured BC was calculated in Figure S6.The median residuals are significantly lower for the RF than the other ML algorithms evaluated in this research.The RF model performed relatively better in handling the residuals than the XGBoost model.Boxplots illustrated that the RF model also illustrated relatively lower median absolute residuals for BC (Figure S6).
Furthermore, closely evaluating the most influential variables drawn from the models illustrated that although all of our models are picking up a unique construction in terms of the essential variables, there are common features that were consistently found influential in all of the models for BC estimations demonstrating further evidence that these variables have robust predictive signals (Figure 6).In general, we observed that BC

Sensitivities Analysis
As mentioned in the previous sections, ML algorithms usually cannot provide physical relevance to the variables' contributions to construct the model.To further understand the sensitivity of the input variables and also understand the impacts of the contributions of our input features, we classified the input variables into six different categories: meteorological parameters (C1), Albedo (C2), land cover type (C3), Leaf Area Index (LAI) parameters (C4) and the grided BC emission database (C5).The detailed grouping information can be found in Table S2.The RF model was then retrained multiple times by applying these various categories and combinations of the variables to evaluate the relative importance of the input categories.Figure 7a-c illustrates the results of the RF model performance by applying each of the categories mentioned above as well as the combined variables.Among all the BC emission data, Wildfire emissions and the total gridded BC emissions were the most influential variables consistently found compelling in all the top common variable lists of all the predictive models.Wildfire emissions are illustrated to appear in the top 5 variables of all the top-performing predictive models, which could be because biomass burning is one of the essential emission sources for BC aerosols.Total gridded BC emission data were also observed to be influential in the top 5 common variable lists of all the predictive models, again illustrating their robust predictive signals.For the meteorological data, forecast surface roughness (fsr), surface pressure (sp) temperature (pt), wind speed and direction data, boundary layer height (blh), and boundary layer dissipation (bld) were observed to be the most effective in the top common meteorological variable list of all the ML predictive models.Other variable categories (other than meteorological and BC emission data) contributed less than 50% to the predictive variable lists of the ML models.Among leaf area index (LAI) variables, we observed that leaf area index low vegetation (lai-lv) was also consistently observed in the top 5 common features drawn from all the ML algorithms, possibly because the LAI also indicates the biogenic emission distribution information.Among other variables, AOD, albedo data (alind and aluvd), land cover type data (i.e., type 13, urban and construction districts), and day-of-year data also contributed to the predictive signals and appeared in the top 20 variable lists of all the ML models.
We also calculated the importance of the model-specific variable for our top-performing RF model and the correlation coefficient with surface BC concentrations, and the results have been provided in Figure S4 in the Supplementary Materials.In some cases, we observed that the variables with high importance correspond to a high correlation coefficient, and those with lower importance also correspond to smaller correlation coefficients.RFspecific variable importance analysis also illustrated that total gridded BC emissions, as well as the wildfire emissions data, showed the highest variable importance contributions (100% and 85%, respectively) in estimating the surface BC concentrations with high correlation coefficients between these variables (r > 0.5) (Figure S4).This was consistent with the model-agnostic permutation-based variable importance results observed in Figure 6.fsr and lai-lv were also observed in the top variable list of our RF-specific variable importance model.In general, the RF-specific variables agreed reasonably well with the permutationbased variable importance followed in Figure 6.Furthermore, it should be noted that although ML algorithms do not provide physical relevance to the contributions of the variables used to construct the model, the agreement of the variable importance measures derived from the two methods as well as the comparable common features derived from the independent permutation-based approach for all of the ML algorithms on the physical meanings in the BC estimations worth further investigation in the future research.

Sensitivities Analysis
As mentioned in the previous sections, ML algorithms usually cannot provide physical relevance to the variables' contributions to construct the model.To further understand the sensitivity of the input variables and also understand the impacts of the contributions of our input features, we classified the input variables into six different categories: meteorological parameters (C1), Albedo (C2), land cover type (C3), Leaf Area Index (LAI) parameters (C4) and the grided BC emission database (C5).The detailed grouping information can be found in Table S2.The RF model was then retrained multiple times by applying these various categories and combinations of the variables to evaluate the relative importance of the input categories.Figure 7a-c illustrates the results of the RF model performance by applying each of the categories mentioned above as well as the combined variables.
The RF model showed the lowest performance with only the land cover type (C3) variable group that indicates only a rough distribution of emission sources (R 2 = 0.41, RMSE = 2.72), which played essential roles in the previous research for PM 2.5 estimation [62].The RF model with only applying LAI data (C4) reached R 2 = 0.53 (RMSE = 2.47), which is probably because the LAI better indicates the biogenic and urban source characteristics with daily variations.The RF performance with only albedo variables (C2) reached R 2 = 0.56 (RMSE = 2.35), which is probably because absorption aerosols can influence the albedo variables (especially the parameters about 0.3-0.7 um).RF model reached R 2 = 0.57 (RMSE = 2.23) by only applying meteorological variables (C1).We also observed substantial improvements in the RF model performance (R 2 = 0.64, RMSE = 2.16) by only using the gridded BC emissions category (C5) (Hereafter called the base model), which is because it includes both the source distribution and the magnitude in high spatial and temporal resolution [18,28,30].This agreed reasonably well with the results observed in variable importance analysis (Figure 6 and Figure S4), where we observed a significant influence of BC predictive signals toward BC emission inventories and meteorological data.This again illustrates that the novel application of gridded BC emissions or the relevant emission information in our current analysis could substantially improve BC model prediction compared with our previous model, as discussed in the earlier sections.These variables provided additional information and helped provide physical relevance to the ML model to explain the variabilities of BC ground-level concentrations sufficiently.

Conclusions
This study explored the feasibility of constructing an emission-based ML model to derive large-scale spatiotemporal estimates of BC concentrations across China.Various ML algorithms have been tainted to construct large-scale BC estimations, and we conducted a series of variable importance and sensitivity analyses on different types of input variables that have physical relevance to BC concentration.Taking advantage of including BC emission inventories and the BC ground-level monitoring database, our constructed BC model has achieved high performance on the large-scale estimation of BC (for the test case, R 2 = 0.77, RMSE = 1.37 μgm −3 ).The model's variable importance analysis and variable sensitivity analysis illustrated that the novel application of the gridded BC emissions played a determined role in the estimation (R 2 = 0.64, RMSE = 2.18 μgm −3 ).At the same time, meteorological data, albedo data, LAI data, and some land cover data were also helpful to improve the model's performance.The model performed well in constructing large-scale estimations of daily-averaged BC concentration over China for 2015, which contains a large area, complex terrain, and diverse climate.At the same time, the model accurately estimated the seasonal-averaged BC concentrations even better.
In summary, the ML model based on gridded BC emissions inventories could substantially improve the accuracy of daily BC concentration estimations and works well both for low and high pollution levels.The constructed model could effectively compensate for the disadvantages of ground-based observations and satellite remote sensing in the spatial distribution, reducing the observational costs and optimizing BC monitoring on a large scale.Considering the positive results obtained from this study, it indicates the potential for estimating long-term BC database or other aerosol species based on the emission-based ML model.Such datasets would provide valuable information for long-term trend analysis, policy-making, and further research on the impacts of BC on air quality and climate.

Supplementary Materials:
The following supporting information can be downloaded at: In addition, the results from Figures 6 and 7 suggest that the performance of the ML models could be substantially improved by adding more physically relevant datasets (Figure 7b).In particular, we can expect substantial model performance improvements by adding all other variable groups to our base model (C5 model).Our analysis illustrated that adding C1 category data (Meteorology) to our base C5 model (C5 + C1) significantly improved the predictive performance of our RF model (R 2 = 0.73, RMSE = 1.84).This agrees perfectly with the results observed in the variable importance analysis (Figures 6 and S4), where meteorology and BC emissions variables contributed to the significant fractions of the top variable lists of all the predictive models.Comparisons of our ML model results with some of the case studies of the chemical transport model (CTM) simulations, including WFR-Chem analysis of ground-level BC and other aerosol speciation in China and East Asia, illustrated that ML algorithms provided comparable performance with the application of meteorology and BC emission inventories database (R 2 ML ≈ 0.7 > R 2 CTM ≈ 0.37-0.65).Our ML model even managed to surpass the performance of the CTM simulations of BC by adding more physically relevant variables.Previous applications of numerical model simulations of BC estimation construction provided an R 2 ≈ 0.37-0.65 [23][24][25][26][27].The lower performance of the CTM models compared with our ML algorithms might be due to the limited number of BC ground-level databases in many CTM studies to be used for evaluation and validation purposes.Furthermore, CTM approaches showed significant sensitivities to variations and distributions of the emission inventories and the meteorological data that could significantly influence the mixing state of the model and cause large discrepancies between the modeled and measured BC concentrations [18].
Adding other variable groups (except C1) to our base C5 model (C5 + C2, C5 + C3, and C5 + C4) did not substantially improve the overall RF model performance (R 2 ≈ 0.66).This again confirms the significant influence of the meteorological data in adding physical relevance and generalizing the BC model (Figure 7b).Our analysis illustrated that adding the C1 category to any other categories will substantially improve our model performance (R 2 ≈ 0.7).
Figure 7c also illustrated the model performance of other scenarios of adding more variable groups into the model training.We demonstrated that adding additional variables to our C5 + C1 model could further increase the predictive performance of our RF model (R 2 ≈ 0.75) by decreasing the model's error estimates.In particular, adding the albedo data helps improve the model performance R 2 = 0.75 (RMSE = 1.77) (C1 + C2 + C5).Adding the Albedo, land cover, and LAI datasets further helps increase the R 2 = 0.76 (RMSE = 1.74) (C1 + C2 + C3 + C4 + C5).In general, it can be seen that although the response of the RF model to additional input variables was not linear, the R 2 and RMSE estimates of the RF model showed gradual improvement.In particular, adding all the variable categories to the final model resulted in lower RMSE values, which demonstrates the gradual maturity of the ML algorithm by adding more information of physical relevance to the predictive model construction.This again illustrated the applicability and versatility of the ML algorithms trained by physically relevant variables to capture and reproduce BC concentration variabilities and their trends in both temporal and spatial scales, which showed good agreements compared with the numerical simulation and CTM analysis with lower computational costs.

Conclusions
This study explored the feasibility of constructing an emission-based ML model to derive large-scale spatiotemporal estimates of BC concentrations across China.Various ML algorithms have been tainted to construct large-scale BC estimations, and we conducted a series of variable importance and sensitivity analyses on different types of input variables that have physical relevance to BC concentration.Taking advantage of including BC emission inventories and the BC ground-level monitoring database, our constructed BC model has achieved high performance on the large-scale estimation of BC (for the test case, R 2 = 0.77, RMSE = 1.37 µgm −3 ).The model's variable importance analysis and variable sensitivity analysis illustrated that the novel application of the gridded BC emissions played a determined role in the estimation (R 2 = 0.64, RMSE = 2.18 µgm −3 ).At the same time, meteorological data, albedo data, LAI data, and some land cover data were also helpful to improve the model's performance.The model performed well in constructing large-scale estimations of daily-averaged BC concentration over China for 2015, which contains a large area, complex terrain, and diverse climate.At the same time, the model accurately estimated the seasonal-averaged BC concentrations even better.
In summary, the ML model based on gridded BC emissions inventories could substantially improve the accuracy of daily BC concentration estimations and works well both for low and high pollution levels.The constructed model could effectively compensate for the disadvantages of ground-based observations and satellite remote sensing in the spatial distribution, reducing the observational costs and optimizing BC monitoring on a large scale.Considering the positive results obtained from this study, it indicates the potential for estimating long-term BC database or other aerosol species based on the emission-based ML model.Such datasets would provide valuable information for long-term trend analysis, policy-making, and further research on the impacts of BC on air quality and climate.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/rs16050837/s1, Figure S1 S1: Detailed information, including the name and the classification of the land cover types obtained in Figure S2 based on the MODIS satellite database; Table S2: Final selected input variables included in the Random Forest Model, including their classifications and data information; Table S3: Hyperparameter tuning of the models employed in this research; Table S4: Permutation-based variable importance for all the input features included in our ML algorithms.The numbers represent RMSE loss after permutations, as illustrated in Figure 6.

Figure 1 .
Figure 1.Variations of the performance measures (R 2 , RMSE, and MAE) of our different predictive models.The dots display the mean, and the lines represent the 95% confidence interval bands for our training and cross-validation dataset.

Figure 1 .
Figure 1.Variations of the performance measures (R2, RMSE, and MAE) of our different predictive models.The dots display the mean, and the lines represent the 95% confidence interval bands for our training and cross-validation dataset.

Figure
Figure 2a,b illustrates the scatter plots of RF model performances in training and 10-fold cross-validation (10-CV) and test case database, respectively, as a function of the total gridded BC emission database.The training and cross-validation database showed a good agreement and a low bias between the estimated and measured BC (R 2 = 0.78, RMSE = 1.37 µgm −3 , Slope = 0.72).Further analysis of our RF model also illustrated an excellent performance for the test-case database (R 2 = 0.77, RMSE = 1.37 µgm −3 , Slope = 0.63).In addition, we also compared the performance of our RF model for BC estimations with our previous physical BC retrieval method, and our RF algorithm outperformed the currently available BC retrieval approach purely based on satellite remote sensing (R 2 = 0.545) described in detail in Bao et al. (2020)[15].Figure2cshows the averaged time series of estimated versus measured BC concentrations across all the stations in China.In general, the time series of the estimated BC across China agreed reasonably well with the measured BC concentrations.However, we observed a slight model underestimation, mainly occurring during wintertime haze and the heavy pollution episodes over China.

Figure 2 .
Figure 2. Scatter plots of measured versus estimated BC concentrations using the RF model as a function of the BC emission inventories.(a) Represents the training and 10-fold crossvalidation, and (b) illustrates the performance on the test-case database.The dashed line illustrates the linear regression fit.Panel (c) illustrates the time series of the average BC concentrations across different locations.

Figure 3
Figure3illustrates the scatter plots of the model performance for different seasons provided by daily averaged data.The R 2 of the four seasons ranges from 0.68 to 0.79.We observed the highest model performance of BC during spring (R 2 = 0.76, RMSE = 1.26, slope = 0.73) and summer (R 2 = 0.68, RMSE = 0.98, slope = 0.76).A slightly diminished model performance was observed for BC performance during autumn (R 2 = 0.79, RMSE = 1.37, slope = 0.67) and winter (R 2 = 0.72, RMSE = 1.63, slope = 0.66), which may be due to the scattered BC emission sources and frequent high pollution events occurrence resulting in more considerable variations across the space and time and therefore decreases the agreement of the model.The performance of the RF model during the seasonal scale again showed better performance than our previous remote sensing BC model described inBao et al. (2020) [15].

Figure 2 .
Figure 2. Scatter plots of measured versus estimated BC concentrations using the RF model as a function of the BC emission inventories.(a) Represents the training and 10-fold cross-validation, and (b) illustrates the performance on the test-case database.The dashed line illustrates the linear regression fit.Panel (c) illustrates the time series of the average BC concentrations across different locations.

19 Figure 3 .
Figure 3. Scatter plots of seasonal averaged measured versus estimated BC concentrations from the RF model as a function of the BC emission inventories (units: ug/m 3 , color represents the BC emission inventories).

Figure 3 .
Figure 3. Scatter plots of seasonal averaged measured versus estimated BC concentrations from the RF model as a function of the BC emission inventories (units: ug/m3, color represents the BC emission inventories).

Figure 4 .
Figure 4. Spatial distribution of the annual-average gridded BC concentration over China.

Figure 4 .
Figure 4. Spatial distribution of the annual-average gridded BC concentration over China.

Figure 6 .
Figure 6.Model independent permutation-based variable importance for the top 20 most essential variables derived from each ML algorithm.

Figure 6 .
Figure 6.Model independent permutation-based variable importance for the top 20 most essential variables derived from each ML algorithm.

19 Figure 7 .
Figure 7. Performance of the Random Forest model in a different scenario.(a) represents the scenarios only including essential variables in each category, (b) corresponds to scenarios including the combination of the two primary variable groups, and (c) corresponds to scenarios including the combination of more than two basic variable categories.

Figure 7 .
Figure 7. Performance of the Random Forest model in a different scenario.(a) represents the scenarios only including essential variables in each category, (b) corresponds to scenarios including the combination of the two primary variable groups, and (c) corresponds to scenarios including the combination of more than two basic variable categories.
: IGBD classification of the land cover type based on MODIS satellite database; Figure S2: Spatial distribution of the annual averages BC emission inventories developed by Peking University over China; Figure S3: Flow chart of the construction of the machine learning (ML) algorithm construction predictive model; Figure S4: Variable importance measures derived from the RF algorithms.The importance values were scaled to 100 for better illustration; Figure S5: Spatial distribution patterns of seasonal-averages BC concentrations estimated by our model across China (retrieved from Bao et al., 2020); Figure S6: Distributions of the residuals of BC predictive models in different quantiles.Boxplots represent the scatter residual distributions of different predictive models; Figure S7: Schematic map of the spatial distribution of BC monitoring sites over China; Table