The Gray-Box Based Modeling Approach Integrating Both Mechanism-Model and Data-Model: The Case of Atmospheric Contaminant Dispersion

.


Introduction
Modeling and simulation has become a most important means to understand and change the target world due to its significant contribution to system analysis [1,2].Driven by the urgent need of the engineering and recent advances in information technology, modeling and simulation has made great progress in these years, and it has been successfully applied in various symmetric fields, including engineering (such as aerospace and manufacturing), society, economy, military, etc.In the modeling and simulation process, the accurate modeling underlies the successive simulation, analysis and prediction of the system.In other words, the accuracy of system prediction is subject to the model's accuracy.It is widely accepted that the choice of the modeling technique depends on the target object (itself) and available research conditions [2].
For most engineering systems (e.g., the electronic system), the mechanism and structure of them are often explicit [3].Therefore, a white-box model can be established based on the prior knowledge of the system mechanism.For instance, the Resistance-Inductance-Capacitance (RLC) circuit system can be modeled by basic circuit laws (e.g., the Kirchhoff's law) and described by differential equations.Generally, the system mechanism includes the structure and the causality of the system.That is to say, the relationship of the input, system structure, and the output can be obtained from this knowledge (e.g., an event or action is a direct consequence of another).This property makes the white-box modeling suitable for the structural and diagnostic analysis of system [2].However, it is usually unpractical to describe a target system totally by the system mechanism because the running process of most systems in the real world is not clear enough (especially the non-engineering system) [4].In practice, many systems (e.g., models in society and economy fields) have a hard time acquiring the exact prior knowledge [5].Fortunately, data-modeling provides an alternative way bypassing the system mechanism.Due to the development of big data these years, many researchers use data-modeling methods in various fields (including science, engineering, economic, industries and others), to predict the future behavior of the target system [6,7].However, the data-based black-box model also has some deficiencies.Firstly, compared with mechanism-based modeling, the black-box model established by the data-modeling technique only describes the correlation of the data (e.g., the case of "beer and diapers" in the market) [8].This deficiency causes that the relationship between the input and the output in the black-box model are hardly interpretable.Secondly, the data-modeling approach builds a model thoroughly relying on the data.This characteristic means that it cannot cope with anomalies and changing circumstances of the system [9].Finally, the simplex data-modeling approach has accuracy limitation when the raw data cannot provide complete information of the system possibly.In summary, both modeling approaches are suitable for solving system-specific modeling tasks (as mentioned before).However, each approach has its limitations when facing complex real systems, e.g., wildfire spread [10] and traffic simulation [11].
To mitigate the disadvantages of single white-box or black-box approach, a new gray-box modeling methodology integrating both mechanism-model and data-model is proposed, which aims to guide people to build corresponding models for well describing gray-box systems.Significantly, when facing different target systems, suitable mechanism knowledge and different statistical as well as machine learning techniques can be applied in the gray-box modeling process.To demonstrate how mechanism-model and data-model are integrated together in specific scenarios, this paper takes the modeling of the air contaminant dispersion (ACD) as an example [12].As a typical gray-box system, we have only partial knowledge (e.g., the Gaussian dispersion model) about ACD and thus cannot build a fully precise predicting model [13,14].Currently, there are some theoretical basis for describing the mechanism of atmospheric dispersion, such as the computational fluid dynamics (CFD) based on the advection-diffusion equations [15,16], the Lagrangian Stochastic (LS) model underlay by the random walk [17,18], and the Gaussian model based on the statistic theory [19,20].However, conventional mechanism-based models are usually static in their running process, due to the fixed model parameters even if the real atmospheric environment is always changing with time dynamically.To deal with the ACD modeling in the dynamic meteorological environment, some investigators employed data assimilation to build the dynamic data driven atmospheric dispersion model, which combines the mechanism model with dynamic observations [21][22][23].Zheng et al. [23] took the assumed leakage accident of Daya Bay nuclear power plant as the case.In the simulated case, the observation of several monitoring stations is assimilated into the Monte Carlo dispersion model by the ensemble Kalman filter (EnKF).The results illustrate that the ACD model with data assimilation outperforms the conventional ACD model in the prediction accuracy.Reddy et al. [22] built a dimension-variant atmospheric dispersion model by integrating the RIMPUFF model and particle filter.The experiment results show that the proposed model has more accurate prediction than both the conventional model and the extended Kalman filter based model.This research shows that the gray-box modeling approach which integrates data assimilation (black-box) with a mechanism (white-box) model is an effective way of the atmospheric dispersion in a dynamic meteorological environment.The improvement of data condition drives more researchers to pay more attention to the data-modeling approach.Among Symmetry 2020, 12, 254 3 of 17 the data-modeling approaches, machine learning is widely applied into the prediction of the air contaminant dispersion.Support Vector Machine (SVM) is a typical machine learning approach [24][25][26].Ma et al. [27] compared and tested several machine learning approaches for ACD modeling in a well-known field dataset Project Prairie Grass, including the SVM approach.Yeganeh et al. [26] combined SVM with the partial least square (PLS) to forecast the CO concentration in the region of Teheran.The proposed gray-box model achieves more accurate prediction as well as higher computational efficiency than single machine learning technique (black-box).We expect our study to provide a new possibility for modeling and simulation.
The rest of this paper is organized as follows.Section 2 presents the principle of the gray-box based modeling approach and two gray-box modeling methods of the ACD process.Two experiment cases combining the data and mechanism are elaborated in Sections 3 and 4, respectively.The results show that the proposed two approaches outperform conventional ones in model accuracy.Finally, the conclusions and future work are illustrated in Section 5.

The Gray-Box Based Modeling Approach
The mechanism-modeling and data-modeling both have their own advantages and limitations.Table 1 illustrates the generalized characteristics of two symmetric modeling approaches from different perspectives.From the point of view of the model representation and structure, the mechanism-modeling approach is to make a dynamic map of the inputs and states to the output variables based on cause-effect, and thereby this model requires knowledge of the target system.Conversely, the data-modeling approach merely associates the input variables with the output variables in static map form.As a result, researchers can build the data model without any prior knowledge.The former approach has been dealt with in research focusing on clear causality, such as physical and operational laws, while the latter has been studied in research areas of intelligent techniques, such as ANN and SVR.

Associational relationship between variables
Structure of the model System knowledge required Dynamic map of (input, state) to output No knowledge about system required Static map of input to output Grass, including the SVM approach.Yeganeh et al. [26] combined SVM with the partial least square (PLS) to forecast the CO concentration in the region of Teheran.The proposed gray-box model achieves more accurate prediction as well as higher computational efficiency than single machine learning technique (black-box).We expect our study to provide a new possibility for modeling and simulation.
The rest of this paper is organized as follows.Section 2 presents the principle of the gray-box based modeling approach and two gray-box modeling methods of the ACD process.Two experiment cases combining the data and mechanism are elaborated in Section 3 and Section 4, respectively.The results show that the proposed two approaches outperform conventional ones in model accuracy.Finally, the conclusions and future work are illustrated in Section 5.

The Gray-Box Based Modeling Approach
The mechanism-modeling and data-modeling both have their own advantages and limitations.Table 1 illustrates the generalized characteristics of two symmetric modeling approaches from different perspectives.From the point of view of the model representation and structure, the mechanism-modeling approach is to make a dynamic map of the inputs and states to the output variables based on cause-effect, and thereby this model requires knowledge of the target system.Conversely, the data-modeling approach merely associates the input variables with the output variables in static map form.As a result, researchers can build the data model without any prior knowledge.The former approach has been dealt with in research focusing on clear causality, such as physical and operational laws, while the latter has been studied in research areas of intelligent techniques, such as ANN and SVR.
The mechanism-modeling approach, which is based on knowledge, enables not only the prediction of the future under a different condition when training, but also the controlling and planning of the system.On the other hand, the data-modeling approach, which is built on learned information, enables to describe the past and to predict the future under the same condition when training.Such a difference allows the mechanism-modeling approach to conduct an analysis of a system with an abnormal or non-existing system, for example, a rare event of new design, which is impossible in the data-modeling approach.However, as a real system has high complexity and various factors influencing it, the mechanism of a real system is hard to be totally obtained.Grass, including the SVM approach.Yeganeh et al. [26] combined SVM with the partial least square (PLS) to forecast the CO concentration in the region of Teheran.The proposed gray-box model achieves more accurate prediction as well as higher computational efficiency than single machine learning technique (black-box).We expect our study to provide a new possibility for modeling and simulation.
The rest of this paper is organized as follows.Section 2 presents the principle of the gray-box based modeling approach and two gray-box modeling methods of the ACD process.Two experiment cases combining the data and mechanism are elaborated in Section 3 and Section 4, respectively.The results show that the proposed two approaches outperform conventional ones in model accuracy.Finally, the conclusions and future work are illustrated in Section 5.

The Gray-Box Based Modeling Approach
The mechanism-modeling and data-modeling both have their own advantages and limitations.Table 1 illustrates the generalized characteristics of two symmetric modeling approaches from different perspectives.From the point of view of the model representation and structure, the mechanism-modeling approach is to make a dynamic map of the inputs and states to the output variables based on cause-effect, and thereby this model requires knowledge of the target system.Conversely, the data-modeling approach merely associates the input variables with the output variables in static map form.As a result, researchers can build the data model without any prior knowledge.The former approach has been dealt with in research focusing on clear causality, such as physical and operational laws, while the latter has been studied in research areas of intelligent techniques, such as ANN and SVR.
The mechanism-modeling approach, which is based on knowledge, enables not only the prediction of the future under a different condition when training, but also the controlling and planning of the system.On the other hand, the data-modeling approach, which is built on learned information, enables to describe the past and to predict the future under the same condition when training.Such a difference allows the mechanism-modeling approach to conduct an analysis of a system with an abnormal or non-existing system, for example, a rare event of new design, which is impossible in the data-modeling approach.However, as a real system has high complexity and various factors influencing it, the mechanism of a real system is hard to be totally obtained.The mechanism-modeling approach, which is based on knowledge, enables not only the prediction of the future under a different condition when training, but also the controlling and planning of the system.On the other hand, the data-modeling approach, which is built on learned information, enables to describe the past and to predict the future under the same condition when training.Such a difference allows the mechanism-modeling approach to conduct an analysis of a system with an abnormal or non-existing system, for example, a rare event of new design, which is impossible in the data-modeling approach.However, as a real system has high complexity and various factors influencing it, the mechanism of a real system is hard to be totally obtained.
From the limitation and comparison of the mechanism modeling (white-box) and data modeling (black-box) approaches discussed above, we confirmed that it is difficult to describe a complex system using a single approach.As a consequence, it is necessary to propose a new modeling methodology that can combine the advantages of each and enhance the performance.It is also important to identify how the mechanism (white-box) model and data (black-box) model are constructed from prior knowledge and observation data, respectively, and then how these two models complement each other.
Focusing on the topic of system modeling, this paper firstly compares the conventional mechanism-modeling and data-modeling approaches, and then proposes the gray-box modeling approach which overcomes their limitations by combining the two modeling approaches.The modeling framework is depicted in Figure 1.The white-box modeling using the system mechanism can only model the system in the ideal experimental condition (the system mechanism is totally available).With regard to the black-box modeling approach, it is based on the data that can only describe the correlation of the system components and ignores the causality.Therefore, to deal with this problem, a modeling approach of the gray-box system should be established integrating the mechanism, dynamic and multi-granularity big data.In this paper, atmospheric dispersion is studied as the case of the typical gray-box system.Then, the two modeling approaches for this gray-box system are proposed.From the limitation and comparison of the mechanism modeling (white-box) and data modeling (black-box) approaches discussed above, we confirmed that it is difficult to describe a complex system using a single approach.As a consequence, it is necessary to propose a new modeling methodology that can combine the advantages of each and enhance the performance.It is also important to identify how the mechanism (white-box) model and data (black-box) model are constructed from prior knowledge and observation data, respectively, and then how these two models complement each other.
Focusing on the topic of system modeling, this paper firstly compares the conventional mechanism-modeling and data-modeling approaches, and then proposes the gray-box modeling approach which overcomes their limitations by combining the two modeling approaches.The modeling framework is depicted in Figure 1.The white-box modeling using the system mechanism can only model the system in the ideal experimental condition (the system mechanism is totally available).With regard to the black-box modeling approach, it is based on the data that can only describe the correlation of the system components and ignores the causality.Therefore, to deal with this problem, a modeling approach of the gray-box system should be established integrating the mechanism, dynamic and multi-granularity big data.In this paper, atmospheric dispersion is studied as the case of the typical gray-box system.Then, the two modeling approaches for this gray-box system are proposed.

Dynamic Data Driven Atmospheric Dispersion Modeling Method (from White-Box to Gray-Box)
The prediction performance of traditional atmospheric dispersion mechanism model depends on the correct setting of model parameters.However, the ACD is affected by many factors.These parameters usually changes dynamically in a field case.Therefore, it is difficult to make real-time and accurate observations of these parameters, especially in dynamic meteorological environments.Therefore, the simple mechanism-based modeling technique fails to give the prediction accurately enough.As a result, it is necessary to model the ACD process in the manner of a gray-box modeling approach.

Dynamic Data Driven Atmospheric Dispersion Modeling Method (from White-Box to Gray-Box)
The prediction performance of traditional atmospheric dispersion mechanism model depends on the correct setting of model parameters.However, the ACD is affected by many factors.These parameters usually changes dynamically in a field case.Therefore, it is difficult to make real-time and accurate observations of these parameters, especially in dynamic meteorological environments.Therefore, the simple mechanism-based modeling technique fails to give the prediction accurately enough.As a result, it is necessary to model the ACD process in the manner of a gray-box modeling approach.
To address this problem, the data assimilation method is applied to introduce the dynamic observation into the Gaussian multi-puffs model.The introduction of the dynamic observation helps to correct and update the model parameters, and improve the prediction accuracy of the gray-box system consequently.The idea of gray-box modeling using data assimilation is shown in Figure 2. To address this problem, the data assimilation method is applied to introduce the dynamic observation into the Gaussian multi-puffs model.The introduction of the dynamic observation helps to correct and update the model parameters, and improve the prediction accuracy of the gray-box system consequently.The idea of gray-box modeling using data assimilation is shown in Figure 2.

Gaussian Dispersion Model
The Gaussian dispersion model is widely used in the modeling of atmospheric dispersion.In practice, some ACD models are also on the basis of the Gaussian model.Because the mechanism is relatively simple, this model predicts ACD process quickly.Taking the instant release of a point source as the case in this paper, the Gaussian puff model can be induced in Equation (1) according to the statistical theory: where x, y, and z are the coordinates in the downwind direction, the crosswind direction, and the vertical direction, respectively.t is the start time of emission, and H represents the height of source.Q is the total mass of atmospheric pollutants contained in puffs, and u is the average wind speed.σx, σy, and σz are the dispersion coefficients at different distance in x, y, and z directions, respectively.These dispersion coefficients are usually determined by the atmospheric stability level, which is commonly classified by the Pasquill-Gifford-Turner method [28,29].A set of dispersion coefficient empirical formulas is [30]: where a, b, c, d are influence factors.It can be seen that, in the Gaussian puff model, the puff propagates with the wind direction.The concentration of atmospheric contaminants obeys Gaussian distribution in x, y and z directions.When the source releases continuously, the continuously released plume can be regarded as a superposition of several puffs which are sequentially released at small time intervals.Therefore, the concentration of atmospheric pollutants at an interest point in space is equal to the superposition of all released puffs: )

Gaussian Dispersion Model
The Gaussian dispersion model is widely used in the modeling of atmospheric dispersion.In practice, some ACD models are also on the basis of the Gaussian model.Because the mechanism is relatively simple, this model predicts ACD process quickly.Taking the instant release of a point source as the case in this paper, the Gaussian puff model can be induced in Equation (1) according to the statistical theory: y (e where x, y, and z are the coordinates in the downwind direction, the crosswind direction, and the vertical direction, respectively.t is the start time of emission, and H represents the height of source.Q is the total mass of atmospheric pollutants contained in puffs, and u is the average wind speed.σ x , σ y , and σ z are the dispersion coefficients at different distance in x, y, and z directions, respectively.These dispersion coefficients are usually determined by the atmospheric stability level, which is commonly classified by the Pasquill-Gifford-Turner method [28,29].A set of dispersion coefficient empirical formulas is [30]: where a, b, c, d are influence factors.It can be seen that, in the Gaussian puff model, the puff propagates with the wind direction.The concentration of atmospheric contaminants obeys Gaussian distribution in x, y and z directions.When the source releases continuously, the continuously released plume can be regarded as a superposition of several puffs which are sequentially released at small time intervals.
Therefore, the concentration of atmospheric pollutants at an interest point in space is equal to the superposition of all released puffs: The function f(t, z, l, q, t start ) represents the dispersion process of single puff.It calculates the concentration at an observation point of the position vector z at time t from the source whose releasing start-time is t start , position vector is l and instantaneous release amount is q.q j is the mass of pollutants released for the jth puff, which is equivalent to the release rate of source.n is the number of puffs, and δ is the interval time between releases of each puff.The concentration of atmospheric contaminants in dynamic source release rate and wind field scenarios can be approximated by function f.

Data Assimilation Framework Based on a Particle Filter
Based on the Gaussian multi-puffs model described above, the ACD process released from the continuous point source can be described by accurate parameter setting.However, the Gaussian model often fails to obtain accurate parameter values of a model timely in practical applications.For one thing, the monitoring devices can only give the average value of a specific parameter of ACD during a period of time.For example, only the average values of wind speed and wind direction can be obtained from meteorological stations.For another thing, the traditional Gaussian model is static, and its parameters always remain unchanged during prediction.Therefore, when the local meteorological conditions change dynamically, it is difficult to model the real ACD process only by a Gaussian multi-puffs model.The mechanism-based model cannot adapt to the dynamic meteorological environment.In order to solve this problem, the particle filter is used as a data assimilation method to construct a dynamic data-driven Gaussian dispersion model [31].In this way, the mechanism model is corrected by the dynamic monitoring data.The system state (model parameters) is estimated by real-time monitoring data, so the model can be applied to the modeling of ACD process, which is a typical gray-box system.
The particle filter, also known as the Sequential Monte Carlo (SMC) method, is based on Bayesian inference and random sampling technique to recursively estimate the state of dynamic systems based on observed data [32,33].The core idea is to use a series of weighted random sampling particles to approximate the posterior probability density function of the system state.Since particle filtering can estimate arbitrary probability density and has fewer assumptions about the model, it becomes an effective method for data assimilation of complex systems.The basic particle filtering algorithm consists of four repeated steps: initialization, importance sampling, weight update, and resampling.In order to apply particle filtering into data assimilation based on a Gaussian multi-puffs model, state space modeling on the dispersion of atmospheric contaminants must be performed.In general, a dynamic system can be described by a discrete state space model [34], including a state transition model (Equation ( 4)) and an observation model (Equation (5)): where s t and m t represent the system state variable observed time t, respectively.The function f in the state transition model describes the evolution of the system state over time, while the function g in the observation model defines the relationship between system state and observation values.γ and ω are independent random variables, describing system state noise and observed noise, respectively.Here, the influence coefficients a, b, c, d in Gaussian multi-puffs model expression are selected as the system state, which are determined by atmospheric stability (affected by meteorological conditions such as solar radiation intensity, wind field, cloud cover, etc.).Due to the influence of various meteorological factors, it is difficult to make accurate predictions in real time.At the same time, the influence coefficients determine that the concentration of atmospheric contaminants follows Gaussian distribution, so they are selected as the system state.In terms of the state transition equation, the change of the level of atmospheric stability is usually slow in the actual environment.In addition, the law of change is not clear, which is difficult to be modeled by mathematical equations.Therefore, we select an identity function to denote the state transfer function f, and use the state noise γ (set as Gaussian state noise in this paper) to realize the transition and evolution of the system state in each time step.In the observation model, since the ACD process is a dynamic system, we can build the relationship between system state and observation (value of ACD) though Gaussian multi-puffs model (as a function g).
The observed noise set as Gaussian white noise in the observation model is usually derived from the observation device itself in reality.By applying dynamic observation data, a gray-box model combining data and mechanism through the particle filtering technique is established.It is suitable for modeling an atmospheric dispersion process (a gray-box system mentioned before) in a dynamic environment.

Atmospheric Dispersion Modeling Method Based on Gaussian-Machine Learning (from Black-Box to Gray-Box)
Other than the mechanism-modeling, data-modeling approaches like machine learning for black-box system is also widely used in atmospheric dispersion [24,35,36].Unlike the mechanism model, the machine learning method builds a mapping relationship between specific parameters and dispersion concentration through a predefined training set [37].The training sets are always based on some datasets of real experiments, e.g., the Prairie Grass dataset [38] and the Indianapolis dataset [39].Furthermore, the dispersion of atmospheric contaminants has been studied by scientists and lots of dispersion mechanisms are proposed.Machine learning models without the consideration of mechanism and prior knowledge often give poor performances due to the complex mechanisms of the real atmospheric dispersion.Therefore, constructing the gray-box model by adding some mechanisms into the data model is a feasible way to solve this problem.

Support Vector Regression
Common used machine learning methods for atmospheric dispersion modeling and prediction include ANN, SVR, and so on [27].Here, SVR is used to construct the atmospheric dispersion model.Different from the SVM model used in classification problems [40], SVR always deals with the regression problems.SVR fits a complex function relationship by mapping input data to high-dimensional feature space and performs the linear regression.The remarkable characteristic of SVR is that its training goal is not to minimize the prediction error, but to make the model more generalized by minimizing the generalization error bound [40].Given a training set (x 1 , z 1 ), . . ., (x l , z l ) , where x i ∈ R n is input and z i ∈ R l is output, the standard form of SVR can be expressed as: where C is the regularization parameter, ε is the error tolerance, ξ and ξ * represent slack variables respectively.The fitting function is listed below: where K(x i , x j ) = φ(x i ) T φ(x j ) represents kernel function, the commonly used Radial Basis Function (RBF) is chosen as the kernel function of SVR in this paper.α is the support vector.In atmospheric Symmetry 2020, 12, 254 8 of 17 dispersion modeling problems, x i ∈ R n represents the related parameter of atmospheric dispersion, and the output y(x) of SVR is the concentration of the interest points.A series of parameters related to atmospheric dispersion are usually selected as input features, as shown in Table 2.The regularization parameter C and the expansion coefficient σ in RBF function affect the complexity and the performance of the model.In order to construct an optimal SVR model, the optimal combination of parameter values will be selected according to the model performance during model construction process.In the construction of SVR model, input parameters are some original observation parameters in atmospheric dispersion scenes.These parameters are easy to obtain in actual scenes, and their quantity and quality are also guaranteed.However, due to the complexity of ACD process, the relationship between many observation parameters and the concentration of interest points is complex and difficult to describe.The complex mapping relationship between input and output brings some difficulties to the training of SVR model, which in turn affects the accuracy of the model.To solve this problem, more efficient input features should be proposed to reduce the training difficulty.The ACD is affected by many factors, so it is difficult to get a high-accurate model.However, ACD is not a complete black-box system.Therefore, the knowledge of mechanism models can be introduced into the feature construction of data model such as SVR.As mentioned in Section 2, Gaussian dispersion model, which is a classical atmospheric dispersion model, can simulate the dispersion of atmospheric pollutants in many scenarios effectively.Moreover, the form of Gaussian dispersion model is simple and fast to calculate.Therefore, the knowledge of Gaussian multi-puffs model is applied to the feature construction of the SVR model in this paper.Two items of Gaussian multi-puffs model G y , G z are selected and then added into the input features of SVR model.The expression is as follows: where G y , G z represent the dispersion coefficient at different distances in the y, z directions.Gaussian parameters described above combine many factors such as wind speed, wind direction, downwind distance, crosswind distance and atmospheric stability level.It is a direct and efficient way to describe the dispersion of atmospheric contaminants.Compared with the original observation parameters, Gaussian parameters G y , G z are high-dimensional features, which can reduce the complexity of the input-output mapping relationship effectively.Thus, they are used to construct the Gaussian-SVR model which is a gray-box model for the prediction of ACD.The idea is shown in Figure 3.

Experimental Design
We constructed a simulated atmospheric pollutant emission and dispersion scenario in a commercial process hazard analysis software (PHAST) [41] to verify the prediction performance of abovementioned dynamic data-driven Gaussian multi-puffs model in dynamic meteorological conditions.In this emission scenario, the study area is a square area of 1000 × 1000 m 2 .The source is located at (0, 0, 50), from which puffs are released at an interval of 10 s throughout the simulation.The release rate of the source, which is also the mass of atmospheric pollutants contained in puffs, is set to a random variable with a mean of 50 g and a standard deviation of 5 g (10% mean).The wind field parameters are modeled as Gaussian white noise.The wind speed obeys a Gaussian distribution with a mean of 3 m/s and a standard deviation of 0.3 m/s, while the wind direction obeys a Gaussian distribution whose mean is 220 degrees and standard deviation is 10 degrees.In order to construct a dynamic meteorological condition, the atmospheric stability level [42,43] is set as changing dynamically with time (shown in Table 3).The dynamic atmospheric stability level will affect the influence coefficients in the Gaussian model.The influence coefficients change linearly in the three time periods of 0-400 s, 400-800 s, and 800-1200 s.Using this simulation scenario, the dispersion of atmospheric contaminants at a height of 30 m is simulated based on a Gaussian multi-puffs model.The simulation time is set as 1200 s.As shown in Table 4, the control experiments are used to illustrate the modeling effect of the data assimilation model in dynamic meteorological conditions.The control group uses traditional Gaussian multi-puffs model as a dispersion model, representing the white-box modeling approach

Experimental Design
We constructed a simulated atmospheric pollutant emission and dispersion scenario in a commercial process hazard analysis software (PHAST) [41] to verify the prediction performance of abovementioned dynamic data-driven Gaussian multi-puffs model in dynamic meteorological conditions.In this emission scenario, the study area is a square area of 1000 × 1000 m 2 .The source is located at (0, 0, 50), from which puffs are released at an interval of 10 s throughout the simulation.The release rate of the source, which is also the mass of atmospheric pollutants contained in puffs, is set to a random variable with a mean of 50 g and a standard deviation of 5 g (10% mean).The wind field parameters are modeled as Gaussian white noise.The wind speed obeys a Gaussian distribution with a mean of 3 m/s and a standard deviation of 0.3 m/s, while the wind direction obeys a Gaussian distribution whose mean is 220 degrees and standard deviation is 10 degrees.In order to construct a dynamic meteorological condition, the atmospheric stability level [42,43] is set as changing dynamically with time (shown in Table 3).The dynamic atmospheric stability level will affect the influence coefficients in the Gaussian model.The influence coefficients change linearly in the three time periods of 0-400 s, 400-800 s, and 800-1200 s.Using this simulation scenario, the dispersion of atmospheric contaminants at a height of 30 m is simulated based on a Gaussian multi-puffs model.The simulation time is set as 1200 s.As shown in Table 4, the control experiments are used to illustrate the modeling effect of the data assimilation model in dynamic meteorological conditions.The control group uses traditional using mechanisms.Moreover, fixed source terms, wind field parameters, and atmospheric stability level are used as estimates for the corresponding parameters in the actual environment.The mass of atmospheric pollutants contained in all puffs is set to 50 g.The wind direction is 220 degrees and the wind speed is 3 m/s.The atmospheric stability level influence coefficients are also set in Table 3.In contrast, the dynamic data-driven Gaussian multi-puffs model is used in the experimental group.Driven by dynamic monitoring data, the dynamic correction and estimation of the influence coefficients of the model are realized.The dynamic monitoring data is acquired by several virtual unmanned aerial vehicles (UAVs) along particular paths in simulated dispersion scenarios which is detailed in reference [44].The source release rate and wind field settings in the experimental group are the same as in the control group.The experiments are carried out as follows: Firstly, the emission and dispersion of atmospheric contaminants under dynamic meteorological conditions are simulated.The monitoring concentration data is collected by the virtual UAVs.In the experimental group, with the input of the monitoring data, the system state (influence coefficient) is corrected and estimated dynamically in the framework of data assimilation.As a result, the prediction of the concentration is obtained.In the control group, the dispersion of atmospheric contaminants is predicted directly according to the traditional Gaussian multi-puffs model.

Experimental Results
The variations of four influence coefficients during the experiments are shown in Figure 4.It can be seen that, in simulated dispersion scenarios, four coefficients are changing with time dynamically.In the control group, since the traditional Gaussian multi-puffs model is static during running process, its model parameters always stay at the initial values.The values of influence coefficients in the model deviate from the initial setting values over time gradually.In contrast, the influence coefficients in the experimental group are corrected and updated dynamically due to the introduction of dynamic monitoring data.With the help of data assimilation, the influence coefficients in the experimental group model are close to the initial setting values.Moreover, the change of influence coefficients also follow similar trends, as shown in Figure 4.In order to compare the performance of proposed model in concentration prediction, the errors at the observation point of the two groups of experiments are calculated.The results are shown in Figure 5.The figure shows that the root-mean-square error (RMSE) of dynamic data driven Gaussian multi-puffs model is significantly lower than that of the traditional Gaussian multi-puffs model in concentration prediction.The error of the control group is due to the fixed influence coefficients in the model.Because the influence coefficients change with time, the error of model accumulates over time, resulting in the rise of RMSE.In comparison, the experimental group guarantees the accuracy of the model by correcting the state of the system with the help of dynamic monitoring data.The effectiveness of the dynamic data-driven Gaussian multi-puffs model can be validated by the comparison of the error distribution.The error distribution shows the prediction of concentration over the area of ACD.As shown in Figure 6, the concentration prediction based on the dynamic datadriven Gaussian multi-puffs model is significantly better than traditional Gaussian multi-puffs model in most areas.According to the discussion in Section 1, with the support of monitoring data, the dynamic data-driven Gaussian multi-puffs model is built by a gray-box modeling approach.The case testifies the better performance and effectiveness of gray-box model.The effectiveness of the dynamic data-driven Gaussian multi-puffs model can be validated by the comparison of the error distribution.The error distribution shows the prediction of concentration over the area of ACD.As shown in Figure 6, the concentration prediction based on the dynamic datadriven Gaussian multi-puffs model is significantly better than traditional Gaussian multi-puffs model in most areas.According to the discussion in Section 1, with the support of monitoring data, the dynamic data-driven Gaussian multi-puffs model is built by a gray-box modeling approach.The case testifies the better performance and effectiveness of gray-box model.The effectiveness of the dynamic data-driven Gaussian multi-puffs model can be validated by the comparison of the error distribution.The error distribution shows the prediction of concentration over the area of ACD.As shown in Figure 6, the concentration prediction based on the dynamic data-driven Gaussian multi-puffs model is significantly better than traditional Gaussian multi-puffs model in most areas.According to the discussion in Section 1, with the support of monitoring data, the dynamic data-driven Gaussian multi-puffs model is built by a gray-box modeling approach.The case testifies the better performance and effectiveness of gray-box model.Figure 7a shows that the prediction accuracy of the original-SVR model is not satisfying, especially in high concentration values (above 200 mg/m 3 ) prediction, which is far lower than the experiment observations.In the meantime, some negative values appear in the predictions of original-SVR model, which is inconsistent with the actual situation obviously.In contrast, Figure 7b indicates that the model predictions of Gaussian-SVR model are closer to the experiment observations and have better accuracy in the prediction of high concentration values.In addition, Figures 7 and 8 both show that the negative values in Gaussian-SVR model predictions are reduced obviously.Some model evaluation indexes are used to measure the performance of prediction models, such as the correlation coefficient squared (R 2 ), the score deviation FB, and the normalized mean square error (NMSE) [24,27,41].These three indexes are calculated and shown in Figure 7. Obviously, the prediction coefficient R 2 of Gaussian-SVR (0.6598) is significantly higher than that of original-SVR (0.4652).Furthermore, the score deviation FB and the normalized mean square error NMSE of Gaussian-SVR predictions (0.0553 and 0.3799) are also lower than original-SVR (0.3565 and 0.9780).Moreover, the fitting curve is also applied by many researchers to evaluate the accuracy of prediction data overall [45,48], which is exhibited in Figure 8.The linear fitting curve of Gaussian-SVR model is more close to "y = x" than that of original-SVR model clearly, indicating the prediction data of Gaussian-SVR (gray-box) model is more accurate.Figure 7a shows that the prediction accuracy of the original-SVR model is not satisfying, especially in high concentration values (above 200 mg/m 3 ) prediction, which is far lower than the experiment observations.In the meantime, some negative values appear in the predictions of original-SVR model, which is inconsistent with the actual situation obviously.In contrast, Figure 7b indicates that the model predictions of Gaussian-SVR model are closer to the experiment observations and have better accuracy in the prediction of high concentration values.In addition, Figures 7 and 8 both show that the negative values in Gaussian-SVR model predictions are reduced obviously.Some model evaluation indexes are used to measure the performance of prediction models, such as the correlation coefficient squared (R 2 ), the score deviation FB, and the normalized mean square error (NMSE) [24,27,41].These three indexes are calculated and shown in Figure 7. Obviously, the prediction coefficient R 2 of Gaussian-SVR (0.6598) is significantly higher than that of original-SVR (0.4652).Furthermore, the score deviation FB and the normalized mean square error NMSE of Gaussian-SVR predictions (0.0553 and 0.3799) are also lower than original-SVR (0.3565 and 0.9780).Moreover, the fitting curve is also applied by many researchers to evaluate the accuracy of prediction data overall [45,48], which is exhibited in Figure 8.The linear fitting curve of Gaussian-SVR model is more close to "y = x" than that of original-SVR model clearly, indicating the prediction data of Gaussian-SVR (gray-box) model is more accurate.Figure 7a shows that the prediction accuracy of the original-SVR model is not satisfying, especially in high concentration values (above 200 mg/m 3 ) prediction, which is far lower than the experiment observations.In the meantime, some negative values appear in the predictions of original-SVR model, which is inconsistent with the actual situation obviously.In contrast, Figure 7b indicates that the model predictions of Gaussian-SVR model are closer to the experiment observations and have better accuracy in the prediction of high concentration values.In addition, Figures 7 and 8 both show that the negative values in Gaussian-SVR model predictions are reduced obviously.Some model evaluation indexes are used to measure the performance of prediction models, such as the correlation coefficient squared (R 2 ), the score deviation FB, and the normalized mean square error (NMSE) [24,27,41].These three indexes are calculated and shown in Figure 7. Obviously, the prediction coefficient R 2 of Gaussian-SVR (0.6598) is significantly higher than that of original-SVR (0.4652).Furthermore, the score deviation FB and the normalized mean square error NMSE of Gaussian-SVR predictions (0.0553 and 0.3799) are also lower than original-SVR (0.3565 and 0.9780).Moreover, the fitting curve is also applied by many researchers to evaluate the accuracy of prediction data overall [45,48], which is exhibited in Figure 8.The linear fitting curve of Gaussian-SVR model is more close to "y = x" than that of original-SVR model clearly, indicating the prediction data of Gaussian-SVR (gray-box) model is more accurate.

(b) Gaussian-SVR model
Through the comparison of the results above, it can be concluded that the prediction performance of the SVR model is significantly improved by introducing the mechanism model knowledge into the feature construction of the SVR model.This shows that Gaussian features constructed by the mechanism model knowledge are more efficient than original parameters, which can reduce the training difficulty of the model effectively and improve the accuracy of the model.

Conclusions and Expectations
With the requirements of system analysis and prediction for human society, modeling and simulation techniques are widely used in many fields gradually, including scientific research, engineering practice, political economy, and so on.As a result, researchers are confronted with high demands on accurate system modeling.Existing mechanism-modeling and data-modeling approaches have good performance in describing white-box and black-box systems, respectively.However, most systems faced in the practical scenarios are gray-box systems.On this occasion, there are inevitable drawbacks and limitations in white-box modeling and black-box modeling.After the comparison of these two symmetric modeling approaches, we integrate these two approaches to obtain a gray-box modeling method which can well describe the gray-box systems.The gray-box modeling approach not only considers the prior knowledge of the target system, but also applies the intelligent statistical as well as machine learning techniques.Taking the typical gray-box system cases of ACD as examples, this paper demonstrates two symmetric gray-box modeling methods which combine both mechanism and data.For the problem in which the mechanism (white-box) model is difficult to model the atmospheric dispersion model in a dynamic meteorological environment, the dynamic monitoring data are served as inputs into the Gaussian multi-puffs model.Through the data assimilation method, the dynamic correction of the model parameters is realized.As a result, the static Gaussian multi-puffs model can be adapted to the modeling of the dynamic dispersion of atmospheric contaminants.For the problem in which the data (black-box) model of atmospheric dispersion prediction is not accurate enough, the knowledge of a mechanism model is introduced into the feature construction of an SVR model.The mechanism knowledge about Gaussian model is used to reduce the difficulty of model training, thus improving the prediction accuracy of the model.
As a main contribution, we provide a possibility of a new symmetric modeling approach which is superior to a single mechanism (white-box) modeling approach or a data (black-box) modeling approach.However, a gray-box modeling approach is only a methodology.When facing different real systems, various prior knowledge and data models can be combined together.Moreover, even the same target system can integrate different prior knowledge and data models.On this occasion, finding effective mechanism models and data models is a great challenge.
In the future, further research on the modeling of ACD in emergency management will be carried out to promote the integration of data and mechanism modeling approaches.This integration is reflected in many aspects.Firstly, the integration of multi-sources data, which means obtaining more accurate meteorological and monitoring data according to the observation of various monitoring resources-secondly, further integration of data models and mechanism models.For example, aiming at the problem of the poor accuracy for the Gaussian-SVR model in complex terrain, the mechanism models (such as CFD) are further combined to design input features for higher precision modeling.

Figure 1 .
Figure 1.The Gray-box based Modeling Approach integrating both the Mechanism model and Data model.

Figure 2 .
Figure 2. From White-box based Modeling to Gray-box Modeling: The Case of Atmospheric Dispersion Modeling.

Figure 2 .
Figure 2. From White-box based Modeling to Gray-box Modeling: The Case of Atmospheric Dispersion Modeling.

Figure 3 .
Figure 3. From Black-box based Modeling to Gray-box Modeling: The case of Atmospheric Dispersion Modeling in Source Estimation.

Figure 3 .
Figure 3. From Black-box based Modeling to Gray-box Modeling: The case of Atmospheric Dispersion Modeling in Source Estimation.

Figure 4 .
Figure 4.The comparisons of dispersion parameters in experiments.(a) Values of coefficient A; (b) Values of coefficient B; (c) Values of coefficient C; (d) Values of coefficient D.

Figure 5 .
Figure 5.The results of concentration prediction.

Figure 4 . 17 Figure 4 .
Figure 4.The comparisons of dispersion parameters in experiments.(a) Values of coefficient A; (b) Values of coefficient B; (c) Values of coefficient C; (d) Values of coefficient D.

Figure 5 .
Figure 5.The results of concentration prediction.

Figure 5 .
Figure 5.The results of concentration prediction.

Figure 7 .
Figure 7.The results of SVR based Atmosphere Dispersion model: Gauss Model-SVR model.

Figure 8 .
Figure 8. Observation data and Prediction data fitting curve: Gauss Model-SVR model.

Figure 7 .
Figure 7.The results of SVR based Atmosphere Dispersion model: Gauss Model-SVR model.

Figure 7 .
Figure 7.The results of SVR based Atmosphere Dispersion model: Gauss Model-SVR model.

Figure 8 .
Figure 8. Observation data and Prediction data fitting curve: Gauss Model-SVR model.

Figure 8 .
Figure 8. Observation data and Prediction data fitting curve: Gauss Model-SVR model.

Table 1 .
Comparison of the mechanism (white-box) model and data (black-box) model.

Table 1 .
Comparison of the mechanism (white-box) model and data (black-box) model.

Table 1 .
Comparison of the mechanism (white-box) model and data (black-box) model.

Mechanism (White-Box) Model Data (Black-Box) Model
No knowledge about system required Static map of input to output (State Q within model) (No state within model) Modeling means Physical and/or operational laws Intelligent techniques (No state within model) Modeling means Physical and/or operational laws Intelligent techniques Condition for valid prediction Model validation System structure remains unchanged before and after training Anomaly/non-existing system Applicable (as in rare event or new design) Not applicable

Table 2 .
Common parameters in atmospheric dispersion.

Table 3 .
Values of atmospheric stability and influence coefficients in simulated dispersion scenarios.

Table 3 .
Values of atmospheric stability and influence coefficients in simulated dispersion scenarios.