Machine Learning to Predict the Global Distribution of Aerosol Mixing State Metrics

Atmospheric aerosols are evolving mixtures of chemical species. In global climate models (GCMs), this “aerosol mixing state” is represented in a highly simplified manner. This can introduce errors in the estimates of climate-relevant aerosol properties, such as the concentration of cloud condensation nuclei. The goal for this study is to determine a global spatial distribution of aerosol mixing state with respect to hygroscopicity, as quantified by the mixing state metric χ. In this way, areas can be identified where the external or internal mixture assumption is more appropriate. We used the output of a large ensemble of particle-resolved box model simulations in conjunction with machine learning techniques to train a model of the mixing state metric χ. This lower-order model for χ uses as inputs only variables known to GCMs, enabling us to create a global map of χ based on GCM data. We found that χ varied between 20% and nearly 100%, and we quantified how this depended on particle diameter, location, and time of the year. This framework demonstrates how machine learning can be applied to bridge the gap between detailed process modeling and a large-scale climate model.


Introduction
Field measurements show that individual aerosol particles are a complex mixture of a wide variety of species, such as soluble inorganic salts and acids, insoluble crustal materials, trace metals, and carbonaceous materials [1,2].To characterize this mixture, the term "aerosol mixing state" is frequently used.This, in general, comprises both the distribution of chemical compounds across the aerosol population ("population mixing state") and the distribution of chemical compounds within and on the surface of each particle ("morphological mixing state").
Both the population mixing state and the morphological mixing state are of importance for aerosol impacts, including chemical reactivity, cloud condensation nuclei (CCN) activity, and aerosol optical properties [3].However, the morphological mixing state is beyond the scope of this study.We will focus here exclusively on the population mixing state, and refer to it for brevity as "mixing state".In this context, the terms "internal" and "external" mixture are frequently used.An external mixture consists of particles that each contain only one species, which may be different for different particles.In contrast, an internal mixture describes a particle population where different species are present within one particle.If all particles consist of the same species mixture, and the relative abundances are identical, the term "fully internal mixture" is commonly used.Considering that aerosol populations contain particles of many different sizes, we can define these terms for the entire populations (comprising all particle sizes) or for individual size ranges.An aerosol population might be approximately internally mixed for a certain size range, but the internal mixture assumption might not be fulfilled if a large size range is considered.While mixing state can impact both CCN properties and optical properties, here we target CCN properties, and interprete aerosol "species" in terms of hygroscopicity.
An example of an external mixture is shown in Figure 1a, which represents a particle population consisting of six particles, with the blue and the red color symbolizing two different aerosol species with different hygroscopicities.A fully internal mixture is shown in Figure 1d.In reality, aerosol populations assume mixing states that are neither fully externally nor internally mixed, as depicted by Figure 1b,c.Note that each of the four populations in Figure 1 contains the same total amounts of the two species, but the species distribution amongst the particles differs.

Externally mixed
Internally mixed Aerosol mixing state is challenging to represent in atmospheric aerosol models.The most rigorous approach is the particle-resolved approach by Riemer et al. [4], which explicitly resolves population mixing state.However, this method is too computationally demanding for routine use in spatially-resolved regional or global chemical transport models.Instead of resolving the full aerosol mixing state, regional and global models therefore use distribution-based methods, commonly known as modal and sectional models [5][6][7].An inherent assumptions of these methods is that within one mode or within one size section, the aerosol particles are assumed to be internally mixed.This assumption can lead to misprediction in climate-relevant aerosol properties such as CCN concentrations and optical properties [8][9][10][11][12].
To illustrate this concept, Figure 2 shows the global distribution of the fraction of hygroscopic species (sulfate, ammonium, sea-salt, and aged organics) as simulated by GEOS-Chem-TOMAS for the month of January 2010 for particles of ∼358 nm.For areas where this fraction is close to 100% (oceans) or close to 0% (parts of the Saharan desert), the aerosol consists essentially of only hygroscopic or only non-hygroscopic species, respectively, so mixing state is not an issue in these areas.However, there are many regions such as the continental US or Europe where the fraction is between the two extremes.For these regions, the question is, given the local conditions, what degree of internal/external mixing is most likely?Our approach seeks to answer this question for different particle sizes, different geographic locations, and different seasons.
To quantify the degree of internal/external mixture Riemer and West [13] introduced the mixing state index χ.This is a scalar quantity that varies between 0% for completely external mixtures and 100% for completely internal mixtures, as indicated by Figure 1.It can be calculated from per-particle species mass fractions (see Section 2.1), which requires either simulations with computationally expensive, high-detail aerosol models [13,14] or observations with a sophisticated suite of instruments [15,16].Ching et al. [14] quantified the relationship of mixing state index χ and the error in CCN concentrations when neglecting mixing state information by assuming a fully internal mixture.The study shows that for more externally mixed populations (χ below 20%) neglecting mixing state leads to errors up to 150%, whereas for populations with χ larger than 75%, the error vanishes (Figure 3).To establish this relationship, Ching et al. [14] used particle-resolved simulations from a 0-D box model scenario library that represented a suite of idealized urban plume scenarios.Thus far, no studies have calculated spatial distributions of the mixing state parameter.This, however, is important for understanding where global models may need to take mixing state into account.The goal of this study is therefore to produce the first global distribution of mixing state parameter χ.This will allow us to map out areas on the globe where low χ values can be expected-these are the areas where we expect large errors in CCN prediction when using a simplified aerosol model that does not or not fully resolve aerosol mixing state.Conversely, it is informative to delineate areas where the mixing state approaches an internal mixture, as for these areas assuming an internal mixture would be appropriate for CCN predictions.
As mentioned before, it is currently not feasible to directly run a particle-resolved aerosol model on a global scale, which would be needed to create a global map of χ directly.We therefore propose an approach that combines particle-resolved modeling and output from a global chemical transport model with machine learning techniques, as outlined in Figure 4.This involves the construction of a scenario library of particle-resolved simulations using the PartMC-MOSAIC, which cover a wide range of conditions that are expected to be encountered in different environments around the globe.This dataset is then used to train a model of χ using machine learning techniques.Importantly, the features of this model are dictated by the list of variables that are known to the global scale model, in our case GEOS-Chem-TOMAS [17,18].Many examples exist in the aerosol modeling literature where parameters for coarser models were derived on the basis of box model simulations that capture certain microphysical processes in detail [19].However, the choice of the explanatory variables (features) and the fitting of the coarse model were typically done "by hand".This approach works well if the relevant parameter space is low-dimensional so that a few features can be identified that govern a certain process.In our case, there are many relevant variables that could potentially influence χ, hence machine learning methods represent an appropriate tool.

Global map of χ
The remainder of the paper is structured as follows: Section 2 describes the tools and methods that are used this study, including the mixing state metric χ, the particle-resolved aerosol model PartMC-MOSAIC, the dataset from the global model GEOS-Chem-TOMAS, the simulations that yield the training and testing dataset, and the machine learning methods.Section 3 presents the global maps of mixing state parameter χ as obtained from the machine learning procedure.Section 4 concludes our results and provides a perspective for future work.

Mixing State Metric χ
We quantified aerosol mixing state with the framework discribed in Riemer and West [13], specifically using the mixing state metric χ.This was inspired by diversity metrics used in other disciplines such as ecology [20], economics [21], neuroscience [22], and genetics [23].
Given a population of N aerosol particles, each consisting of some amounts of A distinct aerosol species, the mixing state metrics can be determined if the masses of species a in particle i are known, denoted by µ a i , for i = 1, . . ., N, and a = 1, . . ., A. From this quantity, all other related quantities can be calculated, as described by Riemer and West [13] and here listed in Table 1.The diversity metrics can then be constructed as summarized in Table 2.
Table 1.Aerosol mass and mass fraction definitions and notation, used to construct the diversity metrics shown in Table 2.The number of particles in the population is N, and the number of species is A. This table is taken from Riemer and West [13].

Quantity
Meaning Mass of species a in particle i Total mass of particle i Total mass of species a in population Total mass of population Mass fraction of species a in particle i Mass fraction of species a in population Table 2. Definitions of aerosol mixing entropies, particle diversities, and mixing state index.In these definitions, we take 0 ln 0 = 0 and 0 0 = 1.This table is taken from Riemer and West [13].

Quantity Name Units
Range Meaning Mixing entropy of particle i -0 to ln A Shannon entropy of species distribution within particle i Average particle mixing entropy -0 to ln A average Shannon entropy per particle Population bulk mixing entropy -0 to ln A Shannon entropy of species distribution within population

Particle diversity of particle i
Effective species 1 to A Effective number of species in particle i

Average particle (alpha) species diversity
Effective species 1 to A Average effective number of species in each particle

Bulk population (gamma) species diversity
Effective species 1 to A Effective number of species in the bulk Degree to which population is externally mixed (χ = 0%) versus internally mixed (χ = 100%) Based on the per-particle mass fractions, the particle diversity D i can be calculated, which can be interpreted as the number of "effective species" of particle i.For a particle consisting of A species, the particle diversity D i can be maximally A, which occurs when all A species are present in equal mass fractions.From the D i values of all particles, we can determine the population-level quantities D α and D γ , with D α being the average effective number of species in each particle, and D γ being the effective number of species in the bulk.The mixing state index χ is defined as The mixing state index χ varies from 0% (a fully externally mixed population) to 100% (a fully internally mixed population).Since χ has the intuitive interpretation of the "degree of internal mixing", it can be used as a metric for error quantification, i.e., to determine the magnitude of error that is introduced in estimating aerosol impacts when neglecting mixing state information.This was shown by Ching et al. [14] for the example of CCN concentration, as illustrated in Figure 3.
The definition of "species" for calculating the mass fractions depends on the application.It can refer to individual chemical species, as in the studies by Riemer and West [13], Healy et al. [15], O'Brien et al. [16], Giorio et al. [24], and Fraund et al. [25].Alternatively, it can refer to species groups, as in Dickau et al. [26] who quantified mixing state with respect to volatile and non-volatile components.Since we are concerned with CCN properties in this paper, we will group the chemical model species according to hygroscopicity, defining two species groups.Black carbon (BC), primary organic aerosol (POA), and freshly emitted mineral dust are combined into one surrogate species, since their hygroscopicities are very low.All other model species (inorganic and secondary organic aerosol species) are combined into a second surrogate species.The mixing state index χ is calculated from these two surrogate species.Note that calculating χ based on the two surrogate species does not bias the value of χ in a systematic way compared to the value based on the individual chemical species.A χ value close to 0% can be interpreted as the hygroscopic and non-hygroscopic species existing in different particles, whereas a χ value close to 100% would correspond to an aerosol population where all particles contain the same amount of hygroscopic and non-hygroscopic species.

Particle-Resolved Aerosol Modeling
A detailed model description of stochastic particle-resolved aerosol model PartMC-MOSAIC is provided by Riemer et al. [4].In summary, PartMC (Particle-resolved Monte Carlo) is a zero-dimensional aerosol model, which explicitly tracks the composition of many individual particles within a well-mixed computational volume.This computational volume is assumed to be representative for a much larger air parcel within the planetary boundary layer.The processes of emission, dilution with the background, and Brownian coagulation are simulated with a stochastic Monte Carlo approach.To improve efficiency of the method, we use weighted particles in the sense of DeVille et al. [27] and efficient stochastic sampling methods [28].
PartMC is coupled with the aerosol chemistry model MOSAIC (Model for Simulating Aerosol Interactions and Chemistry) [29].This includes the gas phase photochemical mechanism CBM-Z [30], the Multicomponent Taylor Expansion Method (MTEM) for estimating activity coefficients of electrolytes and ions in aqueous solutions [31], the multi-component equilibrium solver for aerosols (MESA) for solid-liquid partitioning within particles [32] and the adaptive step time-split Euler method (ASTEM) for dynamic gas-particle partitioning over the size-and composition-resolved aerosol [29].To simulate secondary organic aerosol (SOA) the SORGAM scheme is used [33].The CBM-Z gas phase mechanism includes 77 gas species.MOSAIC treats key aerosol species including sulfate (SO 4 ), nitrate (NO 3 ), ammonium (NH 4 ), chloride (Cl), carbonate (CO 3 ), methanesulfonic acid (MSA), sodium (Na), calcium (Ca), other inorganic mass (OIN), BC, POA, and SOA.The model species OIN represents species such as SiO 2 , metal oxides, and other unmeasured or unknown inorganic species.Our SOA model species include reaction products of aromatic precursors, higher alkenes, α-pinene and limonene.In this study, PartMC includes condensation/evaporation of vapors to/from particles and coagulation between particles.It does not included nucleation in this study, and the limitations on our results will be discussed throughout.
PartMC-MOSAIC has been used in the past for process studies of mixing state impacts on aerosol properties in various environments.For example, Tian et al. [34] investigated the aging of aerosol particles in a ship plume.Ching et al. [12] quantified the response of cloud droplet number concentration to changes in emissions of black-carbon-containing particles, and Mena et al. [35] carried out plume-exit modeling to determine cloud condensation nuclei activity of aerosols from residential biofuel combustion.

GEOS-Chem-TOMAS Dataset
To provide initial concentrations of gas-phase and size-resolved aerosol-phase species in a large-scale global model, we use the Goddard Earth Observing System chemical-transport model, GEOS-Chem, version 10.01 [36] (http://acmg.seas.harvard.edu/geos/)coupled with the TwO Moment Aerosol Sectional (TOMAS) microphysics scheme [17].We simulated the year 2010 with re-analysis meteorology fields from GEOS5 (http://gmao.gsfc.nasa.gov).Simulations included a horizontal resolution of 2 • × 2.5 • and 47 vertical layers.GEOS-Chem includes tracers for 52 gas-phase species.Standard emission setup is described in the study by Kodros et al. [18].We used the 15-bin version of TOMAS, with size sections ranging from approximately 3 nm to 10 µm.TOMAS includes tracers for aerosol number concentration, sulfate, organic aerosol, black carbon, sea salt, and dust.Nucleation in the simulations follows a ternary nucleation scheme involving water, sulfuric acid, and ammonia following the parameterization of Napari et al. [37], scaled with a global tuning factor of 10-5 [38,39].When ammonia mixing ratios are less than 1 pptv, the model defaults to a binary nucleation scheme (sulfuric acid and water) [40].Detailed descriptions of aerosol microphysics included in TOMAS can be found in Adams and Seinfeld [17], Lee et al. [41], and Lee and Adams [42].GEOS-Chem-TOMAS has been evaluated against observed aerosol size distributions [43,44].

Design of the Training and the Testing Scenarios
At the core of the machine learning framework is the design of a scenario library of particle-resolved simulations to create a large number of aerosol populations with different compositions and different mixing states.Scenario libraries that we developed in previous work [10,12,45] focused on urban environments, and in particular on the aging process of carbonaceous aerosol by coagulation and condensation of secondary aerosol.Here, we expanded the list of aerosol types by including sea salt aerosol and dust emissions.
We did not include the process of particle nucleation in this set of training simulations because there are still significant uncertainties about the treatment of particle-level post-nucleation growth mechanisms [46].The lack of nucleation in our training library can be expected to introduce errors into our global mixing state predictions in the smaller size bins where particles may be influenced by nucleation and growth.In particular, we expect that true χ values in the Aitken and accumulation modes will generally be lower than our predicted values in areas with pre-existing non-hygroscopic particles (e.g., from combustion) where significant nucleation occurs because freshly nucleated particles will then create a more-externally mixed population.
All scenarios used a simulation time of 24 h, starting at 6:00 a.m.local time, with output being saved every 10 min.We used 10,000 computational particles for each simulation.The initial conditions for aerosol and gas phase were the same for all scenarios and are identical to Zaveri et al. [8].Specifically, the aerosol initial condition consisted of Aitken and Accumulation mode with internally mixed ammonium sulfate, secondary organic aerosol, and trace amounts of black carbon, as listed in Table 3.Although the initial conditions were fixed in these scenarios, these particles generally evolved substantially over the course of the simulations.However, we cannot rule out that this choice influenced our results, and we will address this in future work by introducing more variability to the design of the initial condition.Twenty-five input parameters were varied between scenarios to represent a range of environmental conditions with different levels of gas phase emissions and emissions of primary aerosol particles to allow for large variations in the mixing state evolution.Latin hypercube sampling [47] was used to provide an efficient sampling across this high-dimensional space.The details of our setup are listed in Table 4.The input parameter space was sampled so that the resulting distributions of simulated variables, such as gas phase and bulk aerosol concentrations, were similar to that of the corresponding distribution in the output data of GEOS-Chem-TOMAS.The distributions need not be identical, but they must be similar enough that the model that is trained from the PartMC library is not required to extrapolate far outside the parameter range on which it was trained.Table 4. List of input parameters and their sampling ranges and procedures to construct the scenario library.See the main text for details.
The temperature was then uniformly sampled from a range of T(φ, m) ± 3σ(φ, m), if 3σ > 8 K, or T(φ, m) ± 8 K otherwise.For simplicity, the sampled temperature was kept constant for the duration of the 24-h simulation.(3) The emission fluxes of aerosol and gases were sampled from a non-uniform distribution by multiplying the maximum emission rate with a random number between 0 and 1 raised to the fourth power.This ensured that our sampling space was skewed towards the lower emission rates, while still retaining some scenarios that represent highly polluted conditions.(4) The aerosol distributions for the emitted carbonaceous particles were prescribed as log-normal, with geometric mean diameter D g and geometric standard deviation σ g .( 5) Sea salt particles were emitted wet rather than dry.For composition, a simplified mixture of 53.89% Cl − , 38.56% Na + , and 7.55% SO 2− 4 by mass was, based on the mass ratio of Cl − to SO 2− 4 of 7.15 in seawater ( [49], p. 384) and adding enough Na + to balance the charges.Additionally, because organic species are a substantial but variable component of sea salt aerosols Vignati et al. [50], a variable amount OC is added, making up 0% to 20% of the mass of the particles.One third of all scenarios had no sea salt emissions.( 6) One third of all scenarios had no dust emissions.
A total of 1000 scenarios were created in this fashion to make up the training library.Since we are saving the output every 10 min of each 24-h simulations, this yields 144,000 particle populations for our training dataset.For testing purposes, a second library of 240 scenarios (34,560 populations) was created in the same manner to gauge the accuracy of the model, using the same distributions, but with different combinations of parameters.This provides a check against overfitting, in which the model that is learned has been fit to the stochastic noise in the training set, resulting in poor predictive performance for any other data set.

Machine Learning as Applied to PartMC
Machine learning refers to a variety of algorithms that are used to identify and model patterns in large datasets, and then use these models to make predictions.It has proven to be a diverse set of tools in the atmospheric sciences.Past applications have included interpreting remote sensing data [51], estimating uncertainty in aerosol optical depth data [52], prediction of aerosol-induced health impacts [53], and forecasting solar radiation for energy generation [54].
Our model predicts χ in a single global-model grid cell, given inputs of the GEOS-Chem-TOMAS variables in that grid cell.We present two variants of this model, one that predicts χ for the bulk aerosol population, and one that predicts χ for each size bin of the global model.A total of 34 input feature variables were used, including gas concentrations, aerosol mass concentrations, aerosol number concentration, solar zenith angle, and latitude.Note that the mass concentrations of the different aerosol species are not lumped into hygroscopic and non-hygroscopic species for this purpose, but are used individually.MOSAIC species were mapped to TOMAS species when training the model.At each horizontal location, we computed the average predicted χ over grid layers up to 840 mb.
We used gradient-boosted regression trees ( [55], Chapter 10) as the machine-learning algorithm for this study, because this is a well-understood algorithm that offers good predictive accuracy with moderate computational cost and is able to perform automatic feature selection during training.Gradient boosting methods [56,57] form a prediction model as a sequence of weak prediction models, each of which fits the residual of the previous predictors in the sequence and thus serves to slightly improve the overall prediction accuracy.
For gradient-boosted regression trees, the weak prediction models are regression trees ( [55], Section 9.2.2), which predict an output value as a tree of decisions on input values.For example, a single depth-2 regression tree for χ might have a first decision of "(latitude > 50 • )?", and if this is true it might have a second decision of "([SO 2 ] < 30 ppb)?", and if this is false then it outputs χ = 0.8.A depth-n tree allows up to n-way interactions between feature variables.
We used the implementation of gradient-boosted regression trees from scikit-learn [58].The model was trained on the training data set and then its performance was evaluated on the testing data set (see Section 2.4).We used a least-squares loss function and all of our gradient-boosted models used 400 decision trees as submodels, as this was sufficient to obtain the best performance on the testing data set.We tested different tree depths, as shown in Figure 5 (left).Similar to many applications ( [55], Chapter 10) we found that tree depths between 4 and 8 worked well, and we used depth 8 for the final model used in the remainder of the paper to give good prediction accuracy with reasonable computational speed.
The performance of our final model is shown in Figure 5 (right).In this figure, a perfect model would be the red 1:1 line.Our model has R 2 = 0.94 and a mean error of 1.67%.The maximum error for any testing scenario is 13.02%.

Predicting χ for the Bulk Aerosol Population
Using output from GEOS-Chem-TOMAS and the model for χ that was trained on particle-resolved data, we can now produce global distributions of χ. Figure 6 shows examples of such distributions using six-hourly output from GEOS-Chem-TOMAS and comparing two different dates, 06:00 UTC on 1 January 2010, and 1 June 2010.Note that χ was calculated based on the entire size range of aerosol particles and hence if coarse-mode particles and fine-mode particles have different compositions, this would result in a lower χ value (more externally mixed), even if the course and fine modes each had higher χ values (more internally mixed).Because χ is a mass-weighted quantity, the χ values for all sizes are dominated by the coarse mode mixing state.
We determined χ only for grid cells that contained between 5% and 95% hygroscopic material, hence excluding areas where essentially only one surrogate species (either hygroscopic or non-hygroscopic) was present.We see from Figure 6 that these excluded areas cover much of the oceans, and much of the Sahara and other deserts.This exclusion is because it is meaningless to discuss the mixing state between hygroscopic and non-hygroscopic material when there is essentially only a single type present.For both dates, the predicted χ varied from 30% to 97%.High χ values existed over industrial source regions including East China, India, and the Eastern and Midwestern United States, with χ approaching 100%.This result can be interpreted that in these regions non-hygroscopic (mainly freshly emitted carbonaceous aerosol) and hygroscopic (mainly secondary) aerosol species are mixed together within the same particle.This prediction is consistent with the fact that highly polluted areas have extremely short aging timescales for carbonaceous emissions [9,59], and so-at least on the scale of the grid resolution used here in GEOS-Chem-TOMAS-assuming an internal mixture of non-hygroscopic and hygroscopic species is appropriate.However, we note that the nucleation is frequently observed in many of these regions, and hence our training data that omitted nucleation may be overestimating χ in some of these regions.
Plumes of aerosol with relatively high χ values of around 80% can also be seen to be transported over the oceans in the outflow of continents, e.g., east of China.This was more prominent for 1 June over the Northern Hemisphere, which is consistent with a larger availability of photochemically produced secondary species that can condense on the originally non-hygroscopic carbonaceous particles, thereby moving the population towards a more internal mixture.

Predicting χ for Individual Size Bins
Rather than including the entire PartMC particle populations for the machine-learning process, we can also group the PartMC output according to particle size first, and then train a separate model for χ for each individual size category.This altered the input feature variables for the model from bulk aerosol mass concentrations and total number concentration to the mass aerosol mass concentrations and number concentration within the size range.
Choosing the TOMAS size bins, we obtained results for the testing data set, as shown in Table 5.The R 2 values are generally lower than for the case without size resolution, which is expected since for each size bin a smaller set of particles is available for learning the model.In fact, for size bins 1-6 (corresponding to dry diameters from ∼3-30 nm), the R 2 value were very low, so that we only discuss the results for size bins 7 and larger (dry diameters above ∼30 nm).In future work, we plan to refine these results by increasing the particle samples in the smaller size bins.Figure 7 shows the global maps of size-resolved χ, based on GEOS-Chem-TOMAS output fields averaged for the months of January and July for size bin 8 (χ 8 , bin median diameter of ∼90 nm) and size bin 14 (χ 14 , bin median diameter of ∼2 µm).Other months had very similar distributions and are not shown.The distribution for χ 8 shows low values of approximately 20% in the Amazon basin, central Africa, and Indonesia.These are areas with large contribution of carbonaceous aerosol from biomass burning.The low χ 8 value in this size range means that the carbonaceous material is externally mixed from other (more hygroscopic) aerosol in these areas.In contrast, internally mixed aerosol is predicted for East Asia and India.For January, plumes with internally mixed aerosol extend from India into the Arabian Sea (winter Monsoon), while, for July, this is not the case (summer Monsoon).
Due to the setup of our scenario library, we need to be aware of some biases that we might introduce with our choices.By using the same initial condition for all simulations, we may underestimate χ in locations where the local emissions are relatively small but different to the initial conditions and where the conditions are not conducive for secondary aerosol formation.Conversely, the χ values for 90-nm particles in the polluted regions (Eastern US, India, Europe, and China) are likely biased high, since our scenario library does not include nucleation, as mentioned in Section 2.4.Nucleation events and growth to 90 nm are routinely observed in these areas, along with primary carbonaceous emissions at these sizes, which may be fresh or aged.Overall, nucleation events are likely to decrease χ in this size range, since a more external mixture would be created.We plan to quantify the impacts of both the initial condition choice as well as the impacts of nucleation on the machine learning procedure in future work.
Figure 8 shows the composition of the aerosol in these two size bins as a fraction of hygroscopic species.This figure confirms that for large areas over the oceans the aerosol consists of only hygroscopic material, and the 2 µm bin over the desert areas contains only non-hygroscopic material, which is the reason why χ was not determined for those areas.Figure 9 shows the size-resolved χ for selected regions, which are indicated in Figure 7 as colored boxes.This figure confirms the strong size dependence for the Amazon region while the other regions do not show a pronounced size dependence.Differences between summer and winter are noticeable for North East China (χ is higher in July), Sahara (χ is higher in July), and Central India (χ is lower in July).A possible explanation for the higher χ values in summer for North East China and the Sahara is a generally larger production of condensable gases during the Northern Hemisphere summer, which help creating a more internal mixture.The lower χ values over India during summer might be related to the Monsoon, which removes both condensable gases as well as aged aerosol.It is interesting to note that the smallest χ value is about 20%, so the classic "external mixture" with χ approaching zero is not found anywhere in these examples.We want to emphasize again that the χ values for smaller sizes in polluted regions such as North East China and Central India may be overestimated because our training data set does not include the process of nucleation.Including nucleation is generally expected to create a more external mixture, since freshly nucleated (hygroscopic) particles would co-exist with carbonacous particles in these environments.

Conclusions
This paper presents the first estimate of spatial distribution of aerosol mixing state over the globe as quantified by the mixing state metric χ.We defined this metric to estimate the degree to which hygroscopic and non-hygroscopic species are mixed on a per-particle basis, with χ = 0% being completely externally mixed and χ = 100% being completely internally mixed.We obtained this global estimate by training a machine-learning model of χ on detailed particle-resolved box model data, and then applying the model to GCM output to predict χ globally.
In some parts of the globe, the aerosol appeared to be quite externally mixed, with χ values as low as 20%, suggesting that an external-mixing assumption is likely to be valid there.This was the case for the size range below 150 nm in regions where biomass burning aerosol dominated, such as the Amazon Basin, Central Africa, and Indonesia.In contrast, the mixing state index χ reached values of 90% for polluted regions in East Asia in July, indicating that an internally-mixed assumption is appropriate for those regions, at least for the spatial resolution of the GCM that was used here.In much of the globe, however, the aerosol mixing state was not clearly internally or externally mixed, which may indicate that assuming either limiting case could lead to significant errors.Previous work by Ching et al. [14] can be used to link the global maps of χ values from this study with estimated errors for CCN concentrations.For the χ values between 30% and 100% found in this study, assuming an internal mixture would introduce an overestimation in CCN concentrations of up to 50%, with the error decreasing to a few percent for χ larger than 80%.For χ values lower than 20%, errors in CCN concentration of up to 100% can occur, but these χ values did not occur in our study.The scenarios in the study by Ching et al. [14] were focused on the aging of carbonaceous aerosol and and therefore did not encompass the full range of conditions that might be encountered around the globe.Nevertheless, they provide guidance of how the predicted distribution of χ values relates to expected errors in CCN predictions when assuming an internal mixture.
While the methodology used in this paper is effective at extrapolating high-detail simulation output to the global scale, it is important to understand the limitations of such a method.Roughly speaking, our model takes the GCM output variables in each grid cell and infers the mixing state χ value from particle-resolved box model simulations with similar corresponding state variables.This could deliver inaccurate χ estimates if there are no similar box model scenarios, if there are multiple box model scenarios that differ significantly in their χ predictions; if the comparison is inexact due to differences in the microphysics/chemistry models between the GCM and PartMC-MOSAIC; or if the matching box model scenarios had significantly different histories and therefore have misleading mixing states.For example, the lack of nucleation in the box model scenarios may well lead to somewhat overpredicted χ values in the sizes up to 90 nm in polluted regions.Additionally, we assumed a composition of our pre-existing particles in our training simulations, which may influence our results presented here.
An important issue that should be addressed in future work is the question of end-to-end verification and validation of the χ predictions.This could be accomplished by performing single-particle measurements in different locations, similar to what has been done in Healy et al. [15] for a single location in Paris during the MEGAPOLIS campaign.Another possibility would be to perform particle-resolved aerosol simulations within a 3D chemical transport model (at great computational expense) to calculate χ directly over small regions, and to compare these explicitly calculated χ values to χ predicted with machine learning.
It will be straightforward to adapt our model training to predict χ based on aerosol optical properties, rather than hygroscopicity.This would answer the question of how absorbing and non-absorbing aerosol species are mixed on a per-particle basis, which is important to capture the absorption enhancement of black-carbon-containing aerosol [60,61].The approach presented in this paper could be generalized to other problems where particle-scale processes cannot directly be simulated within the large-scale modeling framework, but for which accurate small-scale models exist.

Figure 1 .
Figure 1.Schematic of aerosol mixing states for four different aerosol populations that have the same bulk composition.The blue and red color represent aerosol species with different hygroscopicity: (a) fully external mixture; (b,c) intermediate mixing states; and (d) internal mixture.The mixing state metric χ measures the degree of internal mixing, ranging from 0% to 100%.

Figure 2 .
Figure 2. Global distribution of fraction of hygroscopic species as simulated by GEOS-Chem-TOMAS for the month of January for particles of ∼358 nm.

Figure 3 .
Figure 3. Relative error in CCN concentration when neglecting aerosol mixing state as a function of aerosol mixing state index χ.Each dot represents an aerosol population from Ching et al. [14].CCN concentration was evaluated at a supersaturation of 0.6%.

Figure 4 .
Figure 4. Schematic of the learning architecture used to train, test, and use the machine-learning model.

Figure 5 .
Figure 5. (Left) Mean error in the predicted χ values from the testing data set as a function of tree depth for the gradient boosted regression tree model; and (Right) true χ values versus model-predicted values for our final model (corresponding to depth 8 in the left panel).

Figure 6 .
Figure 6.Global distribution of χ from the machine-learning model, at 06:00 UTC on January 1 (left), and June 1 (right), 2010.The model used to predict χ here is trained on the PartMC output that includes the entire particle population.

Figure 7 .
Figure 7. Global distribution of size-resolved χ values from the machine-learning model based GEOS-Chem-TOMAS inputs for the months of: January (top); and July (bottom).(Left) χ for size bin 8, bin median diameter is 89.4 nm.(Right) χ for size bin 14, bin median diameter is 2024 nm.The colored boxes show the regions over which data were averaged for display in Figure 9.

Figure 8 .
Figure 8. Global distribution of fraction of hygroscopic species as simulated by GEOS-Chem-TOMAS for the months of: January (top); and July (bottom).(Left) χ for size bin 8, bin median diameter is 89.4 nm.(Right) χ for size bin 14, bin median diameter is 2024 nm.

Figure 9 .
Figure 9.Size-resolved values of χ forselected regions in: January (left); and July (right).See Figure7for the location of these regions.

Table 3 .
Number concentration, N a , of the initial aerosol population.The aerosol size distributions are assumed to be lognormal and defined by the geometric mean diameter, D g , and the geometric standard deviation, σ g .

Table 5 .
Error statistics for prediction error in size-resolved χ.