A Novel Robust Method for Solving CMB Receptor Model Based on Enhanced Sampling Monte Carlo Simulation

The traditional effective variance weighted least squares algorithms for solving CMB (Chemical Mass Balance) models have the following drawbacks: When there is collinearity among the sources or the number of species is less than the number of sources, then some negative value of contribution will appear in the results of the source apportionment or the algorithm does not converge to calculation. In this paper, a novel robust algorithm based on enhanced sampling Monte Carlo simulation and effective variance weighted least squares (ESMC-CMB) is proposed, which overcomes the above weaknesses. In the following practical instances for source apportionment, when nine species and nine sources, with no collinearity among them, are selected, EPA-CMB8.2 (U.S. Environmental Protection Agency-CMB8.2), NKCMB1.0 (NanKai University, China-CMB1.0) and ESMC-CMB can obtain similar results. When the source raise dust is added to the source profiles, or nine sources and eight species are selected, EPA-CMB8.2 and NKCMB1.0 cannot solve the model, but the proposed ESMC-CMB algorithm can achieve satisfactory results that fully verify the robustness and effectiveness of ESMC-CMB.


Introduction
Atmospheric particulate matter (PM10 and PM2.5, with diameters less than 10 µm and 2.5 µm) is a mixture of solid or liquid particles suspended in the air, and is an important air pollutant in urban environments [1][2][3].Epidemiological studies have shown that PM2.5/PM10 and an increase in respiratory symptoms, lung cancer mortality, and cardiovascular disease are closely related [4][5][6][7][8][9][10].China is one of the countries with the most serious PM2.5 pollution in the world.In recent years, a total of 28 provinces and cities have reported heavy PM2.5 pollution phenomena; on average, each province has an annual total of nearly 20 days of heavy pollution.
At present, haze is frequent in China, affecting a wide range and having a long duration, which causes inconvenience to public life, threatens human health, and causes great concern for society and the government.Understanding and clarifying the potential sources and their contributions of PM2.5 is important [4].The work of source apportionment of PM2.5 has become one of the core strategies in the prevention and control of atmospheric pollution.
The CMB (Chemical Mass Balance) air quality model [5,6] is the most important model of atmospheric particulate matter source apportionment technology [7], recommended by the United States' EPA (Environmental Protection Agency), mainly used to study the TSP (Total Suspended Particulate), PM2.5, PM10, and VOC (Volatile Organic Compounds) as well as other sources of pollutants and their contribution.CMB receptor models are established according to the principle of mass balance, and the chemical concentration of pollutants can be expressed by the sum of the product of the species richness and the source contribution.
The CMB receptor model [8,9] is composed of a set of linear equations, which indicates that the receptor concentration of each chemical element is equal to the linear sum of the product of the element content and the source contribution concentration.The basic principle of CMB model is mass conservation.It is assumed that there are several sources (J) that contribute to atmospheric particulates in the receptor, and that: (1) compositions of source emissions are constant over the period of ambient and source sampling; (2) the number of sources or source categories is less than or equal to the number of species; (3) the chemical composition of the particulate matter emitted by the various sources is significantly different; (4) the chemical composition of the particulate matter emitted by the source class is relatively stable; (5) all sources that make an obvious contribution to the receptors have their respective emission characteristics; (6) there is no interaction between the particles emitted by the source class, so the change in the process of transmission can be ignored; and (7) measurement uncertainties are random, uncorrelated, and normally distributed.Then the total substance concentration measured on the receptor is the linear sum of the contribution of each source.
At present, the most commonly used algorithm for solving CMB model is the EVWLS method [17], which is derived by minimizing the weighted sums of the squares of the differences between the measured and the calculated values of C i and F ij , and is a practical method for calculating the contribution of the source S j and the error σ S j : where the effective variance is , σ S j (µgm −3 or g/g) is the uncertainty in source contribution S j (µgm −3 or g/g), σ C i (µgm −3 or g/g) is the uncertainty in the ambient concentrations species i, and σ F ij is the uncertainty in the fraction of species i in the source j profile.The EVWLS method is actually an improvement over the ordinary weighted least squares method to minimize the sum of squares of the differences between the weighted chemical composition measurements and the calculated values.
However, there are some weaknesses to the above algorithms, such as near collinear sources resulting in incorrect source contributions, and the requirement that the number of chemical species be greater than or equal to the number of sources.At the same time, most of the above algorithms are finally transformed into optimization algorithms, which are mostly NP (Non-deterministic Polynomial) problems.So, in general, we get a locally optimal value or suboptimal value instead of a globally optimal value.So, instability is a fatal drawback to these algorithms, that is to say that different runs of the same input dataset at different times using the same algorithm may produce very different outputs or exhibit high variance with the same diagnostic criteria.
The Monte Carlo method [18], also known as stochastic simulation or statistical experiments, is based on statistical theory, according to the law of large numbers, using computer simulation technology [19] to solve some practical problem that is difficult to figure out directly with mathematical or other methods.The Monte Carlo method uses computer programs and mathematical models [20] to simulate practical random phenomena, through simulation experiments to get experimental data, and then infers from the analysis to get the law of certain phenomena.Monte Carlo simulation [19] is a method for exploring the solution and sensitivity of a complex system by varying the parameters within the statistical constraints.It is widely used in many fields such as engineering [21], environmental science [22], statistical physics [23], biophysics [24], materials science [25], and financial engineering [26].Many practical problems are often accompanied by many random factors.If we take these factors into account, the model will become too complex to solve.However, we can utilize the Monte Carlo method to generate a random number to simulate these complicated phenomena, and then find out the operation law.The validity of the Monte Carlo method relies on the sampling process in simulation.However, the simple Monte Carlo algorithm converges too slowly, and it is easy to converge to local extreme points.
In this paper, we explore a novel robust method for solving CMB receptor model based on enhanced sampling Monte Carlo simulation, which overcomes the shortcomings of the above algorithms.In other words, when collinearity exists in the source profiles or the number of source profiles is greater than the number of species, the ESMC-CMB (Enhanced Sampling Monte Carlo CMB) algorithm can come to the correct results for source apportionment.In general, these enhanced sampling methods can be employed to help us quickly find an optimal stable solution when the model is complex, nonlinear, or involves more than just a couple uncertain parameters.
This paper is organized as follows.Section 2 provides a literature review about the CMB model and enhanced sampling Monte Carlo simulation.In Section 3, the proposed enhanced sampling Monte Carlo CMB algorithm (ESMC-CMB) is described.Section 4 presents the related numerical experiments and a comparison with various traditional algorithms.Finally, conclusions are given in Section 5.

CMB Model and Enhanced Sampling Monte Carlo Simulation
Methods commonly used for the particulate source apportionment include receptor model, source emission inventory, and source dispersion models.The source emission inventory method determines its contribution rate by investigating and accounting for emission factors and activity levels for different source categories.The source dispersion model is a combination of meteorological conditions, emission sources, and chemical processes to assess the distribution and contribution of different source classes in three dimensions [27].The receptor model is a commonly used model in source apportionment.
In general, due to source j with constant emission rate E j , the source contribution S j present at a receptor during a sampling period of length T is where: D j is a dispersion factor depending on atmospheric stability (σ), wind velocity (u) and the location of source j with respect to the receptor (x j ).All parameters in Equation (2) vary with time, so the instantaneous D j must be integrated over time period T [27].
The CMB receptor model consists of a solution of a linear equation that represents the chemical concentration of each receptor as the product of source profile abundance and source contribution.Resource profile abundances (i.e., mass fractions of certain chemicals or other properties emitted from each source) and receptor concentrations (estimated with appropriate uncertainties) are used as input data for CMB.In order to distinguish the contribution of source types, the measured chemical and physical properties must occur in different proportions of source emissions, and the changes of these proportions between source and recipient can be neglected or approximated.The CMB model calculates the contribution values of each source and the uncertainties of these values.The principle of the CMB receptor model is shown in Figure 1.
proportions between source and recipient can be neglected or approximated.The CMB model calculates the contribution values of each source and the uncertainties of these values.The principle of the CMB receptor model is shown in Figure 1.The receptor model was used to identify the source of the receptor and determine the quantitative contribution of various sources to the receptor by analyzing the chemical tracers of the source of the environmental samples and the emission sources.If there is no interaction between their emissions to cause mass removal, the total mass measured at the receptor C is a linear sum of the contributions of the individual sources j S : Similarly, the mass concentration of elemental component i, Ci, will be where Fij is the mass fraction of source contribution Sj composed of element i at the receptor.The number of chemical species (I) must be greater than or equal to the number of sources (J) for a unique solution to these equations.Equations ( 4) and ( 5) are based on material immortality and mass conservation.In Equation ( 5), Ci and Sj are the inputs to the model, and Fij is the source contribution we need to calculate.
There are several methods to solve the CMB receptor models: (1) the tracer element method [28]; (2) an ordinary weighted least squares solution [28]; (3) a linear programming solution [29], which maximizes the sum of the source contributions; (4) a ridge regression weighted least squares solution with or without an intercept [30] that is one approach for handling the multi-collinearity; (5) a neural networks solution; and (6) an EVWLS solution, which is the most common algorithm.
At present, the most commonly used algorithm to solve the CMB model is the effective variance least squares method, because this method is a practical method to calculate the error S j σ of source contribution j S .The effective variance least squares method is actually an improvement on the ordinary weighted least squares method, which minimizes the sum of squares of the difference between the measured and calculated values of the weighted chemical components: The receptor model was used to identify the source of the receptor and determine the quantitative contribution of various sources to the receptor by analyzing the chemical tracers of the source of the environmental samples and the emission sources.If there is no interaction between their emissions to cause mass removal, the total mass measured at the receptor C is a linear sum of the contributions of the individual sources S j : Similarly, the mass concentration of elemental component i, C i , will be where F ij is the mass fraction of source contribution S j composed of element i at the receptor.
The number of chemical species (I) must be greater than or equal to the number of sources (J) for a unique solution to these equations.Equations ( 4) and ( 5) are based on material immortality and mass conservation.In Equation ( 5), C i and S j are the inputs to the model, and F ij is the source contribution we need to calculate.
There are several methods to solve the CMB receptor models: (1) the tracer element method [28]; (2) an ordinary weighted least squares solution [28]; (3) a linear programming solution [29], which maximizes the sum of the source contributions; (4) a ridge regression weighted least squares solution with or without an intercept [30] that is one approach for handling the multi-collinearity; (5) a neural networks solution; and (6) an EVWLS solution, which is the most common algorithm.
At present, the most commonly used algorithm to solve the CMB model is the effective variance least squares method, because this method is a practical method to calculate the error σ S j of source contribution S j .The effective variance least squares method is actually an improvement on the ordinary weighted least squares method, which minimizes the sum of squares of the difference between the measured and calculated values of the weighted chemical components: where the effective variance is , σ S j (µgm −3 or g/g) is the uncertainty in source contribution S j (µgm −3 or g/g), σ C i (µgm −3 or g/g) is the uncertainty (i.e., measurement errors) in the ambient concentrations species i, and σ F ij is the uncertainty (i.e., measurement errors) in the fraction of species i in the source j profile.The matrix form of the CMB model is as follows: The steps of EVWLS iterative algorithm for solving the CMB model (Equation ( 7)) are as follows: 1.
Set the initial estimate of the source contributions equal to zero: Calculate the diagonal components V e f f ,i of the effective variance matrix.All off-diagonal components of this matrix are equal to zero: 3.
Calculate the K + 1 value of S j : If the result of Equation ( 10) is greater than 1%, the previous iteration is executed; if less than 1%, the iteration is terminated.

5.
Calculate the value of σ S j in the K + 1 step iteration, then where T is a column vector with S j as the jth component; F is an I × J matrix of F ij , the source composition matrix; σ ci is one standard deviation uncertainty of the C i measurement; σ F ij is one standard deviation uncertainty of the F ij measurement; and V e is diagonal matrix of effective variances.
The above algorithm shows that the input parameters of the model are: the measured values of the concentration spectrum of the chemical components of the receptor C i and the standard deviation σ C i of C i , the measured values F ij of the source chemical composition spectrum and the standard deviation The output parameters of the model are: the calculated source contribution values of S j and the standard deviation σ S j of S j , the calculated source contribution values of chemical composition S ij , and the standard deviation σ S ij of S ij .
In the actual work of source apportionment, there are two commonly used software tools, EPA-CMB8.2(V8.2, EPA, Washington, USA, 2004) and NKCMB1.0(V1.0, Nankai University, Tianjin, China, 2005), which are the concrete implementation of above effective variance least squares algorithm for solving the CMB model.
The CMB receptor model is one of the standard methods used by the U.S. Environmental Protection Agency (EPA) to assess air quality.The practical tool software EPA-CMB8.2based on the CMB model and the effective variance least squares algorithm is recommended by the EPA.NKCMB1.0 is a practical software tool for PM2.5 source apportionment, developed by the Key Laboratory of Urban Air Particulate Pollution Prevention and Control, Nankai University, Tianjin China, based on the CMB receptor mathematical model and the corresponding effective variance least squares algorithm.NKCMB1.0 is more suitable for source analysis and calculation in China's more complex air quality environment.
As a stochastic method, Monte Carlo modeling can be used to describe and analyze complex problems by computer simulation sampling based on probability theory combined with certain statistical methods.Although the method emerged in the 1940s, it was limited to defense-related nuclear technology because it required sufficient computing resources to analyze the neutron behavior in matter [20].With the rapid development of high-speed computers, the Monte Carlo simulation method is more and more widely used [19,20].
The basic idea of the Monte Carlo method is to establish an appropriate probability model or stochastic process so that its parameters (such as the probability of events, the mathematical expectation of random variables) are equal to the solution of the problem.Then repeated random sampling test of the model or process are carried out.With the statistical analysis to the results, the final calculation of the parameters, the approximate solution is obtained.
For example, in a Monte Carlo Simulation problem we represent the quantity we want to know as the expected value of a random variable Y, such as µ = E(Y).Then we generate values Y 1 , • • • , Y n randomly and independently from the distribution of Y and get their average: as the estimate of µ.However, the convergence speed of the above simple sampling Monte Carlo method is too slow; for a large dimension sampling space, the time to complete the sampling calculation is intolerable.This paper will explore a new, enhanced sampling method to accelerate the convergence of the algorithm from the following aspects.
Firstly, in the process of solving the receptor CMB model, if the diagnostic indicator S j /C < λ, the results did not meet the requirements.So we could sample in the S j /C ≥ λ, for which the dimensions of the sample space will be reduced to some extent, and in the following experiment, λ = 0.7 will be selected.In the new sampling space, the Gibbs sampling method will be used.
Gibbs sampling [31-33] or a Gibbs sampler is a MCMC (Markov chain Monte Carlo) algorithm for obtaining a sequence of observations that are approximated from a specified multivariate probability distribution.Like other MCMC algorithms, Gibbs sampling from Markov chain can be regarded as a special case of the Metropolis-Hastings algorithm; its sampling distribution can be deduced from the properties of the Markov chain and probability transition matrix, and it finally converges to joint distribution.The name of the algorithm originated from Josiah Willard Gibbs and was proposed by brothers Stewart and Donald Gemman in 1984 [31][32][33].Gibbs sampling is suitable for multivariate distribution, where conditional distribution is easier to sample than edge distribution.At the same time, in order to accelerate the convergence speed of the simulation process, in this paper we adopt the enhanced Gibbs sampling algorithm from [34], called the enhanced sampling algorithm for short.
In order to overcome the shortcomings of the effective variance algorithm for solving the CMB model, in this paper, the EVWLS (effective variance weighted least square) algorithm will be combined with the Monte Carlo simulation algorithm of enhanced sampling to obtain a novel robust ESMC-CMB algorithm for solving the CMB receptor model.The algorithm is programmed by using MATLAB (V8.5, Mathworks, Natick USA, 2015) and implemented through numerical experiments with a real background.By comparing with the results of EPA-CMB 8.2 and NKCMB 1.0, the accuracy, robustness, and superiority of ESMC-CMB algorithm are fully verified.

Solving CMB Model Based on Enhanced Sampling Monte Carlo Simulation
For the CMB model with consideration of random error: where C i is the ambient concentration of species i, S j is the source contribution of source j, F ij is the fraction of species i in source j, ε i is for errors.The number of chemical species (I) must be equal to or greater than the number of sources (J) for a unique solution to these equations.Equation ( 13) is solved by an effective variance weighted least squares approach: minimizing χ 2 , where In the CMB model, uncertainties in the source contribution are estimated as where σ S j (µgm −3 or g/g) is the uncertainty in source contribution S j (µgm −3 or g/g), σ C i (µgm −3 or g/g) is the uncertainty in the ambient concentrations species i, and σ F ij is the uncertainty in the fraction of species i in the source j profile.Uncertainties in input variables are propagated by inversely weighting the EV (effective variance).
In this paper a new method for solving CMB receptor model based on the enhanced sampling Monte Carlo simulation was proposed as follows: Generate random inputs : S j with Enhanced Gibbs sampler Then we can get the following ESMC-CMB algorithm: Algorithm ESMC-CMB: Given the initial receptor and source profile data the number of source and receptor components I, the number of source J, obj = 10 100 , the number of simulation times N, n = 0.
Step 1: Generate random variables with the enhanced sampling Monte Carlo method proposed in this paper: Step 2: If J ∑ j=1 S j ≥ C, go to step 1.
Step 4: if n < N then step 1.

Application to a Realistic Case
This realistic case focuses on the dataset from a city in China.The profiles of the receptor and source component are shown in Tables 1 and 2. EPA-CMB8.2and NKCMB1.0software can be used to solve the CMB model when the number of sources or source categories is less than or equal to the number of species.So, firstly, we select nine sources (Soil Dust, Construction Dust, Coal Combustion, Cooking Smoke, Biomass Burning, Industrial Processes, NO 3 − , SO 4 2− , Vehicular Emissions) and nine components (Al, Si, K, Ca, Fe, OC, EC, NO 3 , SO 4 ), and use EPA-CMB8.2and NKCMB1.0 to calculate source apportionment with the data in Tables 1  and 2; the results are shown in Figures 2 and 3.
Processes 2019, 7, x FOR PEER REVIEW 10 of 15 EPA-CMB8.2and NKCMB1.0software can be used to solve the CMB model when the number of sources or source categories is less than or equal to the number of species.So, firstly, we select nine sources (Soil Dust, Construction Dust, Coal Combustion, Cooking Smoke, Biomass Burning, Industrial Processes, NO₃ − , SO₄ 2− , Vehicular Emissions) and nine components (Al, Si, K, Ca, Fe, OC, EC, NO3, SO4), and use EPA-CMB8.2and NKCMB1.0 to calculate source apportionment with the data in Tables 1 and 2; the results are shown in Figures 2 and 3.With the same selection of the source profiles and receptor components and the same dataset, we use our proposed ESMC-CMB algorithm to calculate the source apportionment, and the results are shown in Figure 4. Table 3 shows the numerical comparison of the contribution rates of the above three algorithms.With the same selection of the source profiles and receptor components and the same dataset, we use our proposed ESMC-CMB algorithm to calculate the source apportionment, and the results are shown in Figure 4. Table 3 shows the numerical comparison of the contribution rates of the above three algorithms.From the results of Figures 2-4 and Table 3, we can see that the results of source apportionment calculated with the above three algorithms are very close, and the correctness of the ESMC-CMB algorithm is verified.
If eight species such as Al, Si, K, Ca, Fe, OC, EC, and NO 3 − are selected, then the software EPA-CMB8.2and NKCMB1.0cannot be used because the number of species is less than the number of sources, but the proposed algorithm ESMC-CBM can calculate the results in Figure 5.As there is strong collinearity between the sources Raise Dust (RD) and Soil Dust, if RD is added to the source profiles (Soil Dust, Construction Dust, Coal Combustion, Cooking Smoke, Biomass Burning, Industrial Processes, NO 3 − , SO 4 2− , Vehicular Emissions) to participate in the calculation using EPA-CMB8.2and NKCMB1.0,some values of source contribution will be negative, so correct results cannot be obtained.However, using our proposed ESMC-CMB algorithm, we can get the correct value of the source apportionment as shown in Figure 6.A comparison of the above results is given in Table 4.As can be seen clearly from Table 4, in the practical instances for source apportionment, when nine species and nine sources, with no collinearity among them, are selected, EPA-CMB8.2,NKCMB1.0, and ESMC-CMB can obtain similar results.However, because there is strong collinearity between source Raise Dust (RD) and Soil Dust, when the source Raise Dust is added to the source profiles, or nine sources and eight species are selected, EPA-CMB8.2and NKCMB1.0cannot solve the model, but the proposed ESMC-CMB algorithm can come to a satisfactory results, which fully verify the robustness and effectiveness of ESMC-CMB.A comparison of the above results is given in Table 4.As can be seen clearly from Table 4, in the practical instances for source apportionment, when nine species and nine sources, with no collinearity among them, are selected, EPA-CMB8.2,NKCMB1.0, and ESMC-CMB can obtain similar results.However, because there is strong collinearity between source Raise Dust (RD) and Soil Dust, when the source Raise Dust is added to the source profiles, or nine sources and eight species are selected, EPA-CMB8.2and NKCMB1.0cannot solve the model, but the proposed ESMC-CMB algorithm can come to a satisfactory results, which fully verify the robustness and effectiveness of ESMC-CMB.

Having results
The collinearity exist in sources No results

No results
Having results

Conclusions
In this paper, a new robust algorithm for a CMB receptor model based on enhanced sampling Monte Carlo simulation and the effective variance weighted least squares is proposed.Because of the weaknesses of the traditional algorithms and software for CMB receptor source apportionment model such as collinearity and the requirement that the number of chemical species be greater than or equal to the number of sources, in many cases, software such as EPA-CMB8.2and NKCMB1.0cannot obtain results for the source apportionment or some values of the source contribution are negative.However, the proposed robust novel ESMC-CMB algorithm can overcome the above weaknesses and achieve satisfactory results.In the realistic source apportionment experiments, firstly, we selected nine sources (Soil Dust, Construction Dust, Coal Combustion, Cooking Smoke, Biomass Burning, Industrial Processes, NO 3 − , SO 4 2− , Vehicular Emissions) with no collinearity among them and nine species (Al, Si, K, Ca, Fe, OC, EC, NO 3 , SO 4 ), and used the EPA-CMB8.2,NKCMB1.0, and ESMC-CMB algorithms to calculate source contributions, and got similar results, but when we selected eight species and nine sources or added Raise Dust to the source profiles, because of the collinearity with Soil Dust, EPA-CMB8.2and NKCMB1.0 could not obtain correct results; however, the proposed ESMC-CMB algorithm can calculate the right results for source apportionment.This has fully demonstrated the robustness and effectiveness of the ESMC-CMB algorithm.Although the ESMC-CMB algorithm has many advantages, there is often missing data in the actual problem.How to further improve the ESMC-CMB algorithm in the case of missing data is the next area of research to tackle.
Due to the limitations of the CMB model, in the realistic study of air pollution, the results of source analysis from the ESMC-CMB algorithm should be referred to the calculation results of other models, such as PMF (Positive Matrix Factorization) and CMAQ (Community Multiscale Air Quality), to obtain more reasonable results.

Figure 1 .
Figure 1.The principle of the Chemical Mass Balance (CMB) receptor model.

Figure 1 .
Figure 1.The principle of the Chemical Mass Balance (CMB) receptor model.

Figure 2 .
Figure 2. The result using EPA-CMB8.2with nine sources and nine species.

Figure 2 .
Figure 2. The result using EPA-CMB8.2with nine sources and nine species.

Figure 2 .
Figure 2. The result using EPA-CMB8.2with nine sources and nine species.

Figure 3 .
Figure 3.The result using NKCMB1.0 with nine sources and nine species.

Figure 3 .
Figure 3.The result using NKCMB1.0 with nine sources and nine species.

Figure 4 .
Figure 4.The result using proposed ESMC-CMB algorithm (with nine sources and nine species.

Figure 4 .
Figure 4.The result using proposed ESMC-CMB algorithm (with nine sources and nine species.

Figure 5 .
Figure 5.The result using ESMC-CMB with nine sources and eight species.

Figure 5 .
Figure 5.The result using ESMC-CMB with nine sources and eight species.

Figure 5 .
Figure 5.The result using ESMC-CMB with nine sources and eight species.

Figure 6 .
Figure 6.The result using ESMC-CMB with 10 sources and nine species including RD (Raise Dust) collinear with Soil Dust.

Figure 6 .
Figure 6.The result using ESMC-CMB with 10 sources and nine species including RD (Raise Dust) collinear with Soil Dust.